LangChainLangGraphFastAPIOllamaSSEAI AgentsProduction

Building a Production-Ready LangChain DeepAgent with SSE Streaming

Published April 13, 20267 min read

Building a Production-Ready LangChain DeepAgent with SSE Streaming

AI agents are everywhere — but most demos stop short of the hard parts: observable reasoning, real-time output, stateful conversations, and production-grade rate limiting. I recently built LangChain DeepAgent, a production-ready agent API that tackles all of these head-on.

In this post I'll walk through the architecture, the four specialized skills, how SSE streaming works in practice, and the production patterns that make it reliable.

The Problem

Most LLM integrations are fire-and-forget: send a query, wait 10–30 seconds, get a response. That's fine for simple Q&A, but it breaks down for autonomous agents that:

Need to reason, plan, search, and write across multiple steps
Should stream output in real-time so users see progress, not a spinner
Must remember context across a multi-turn conversation
Need abuse protection to be safely exposed to users

LangGraph + FastAPI + SSE solves all of this cleanly.

Architecture Overview

LangChain DeepAgent architecture — FastAPI fronts a LangGraph agent with four skills, an Ollama LLM, and Redis for rate-limiting + session memory

At a glance:

Client → FastAPI → DeepAgent (LangGraph) → 4 Skills
                         ↕
                       Ollama (llama3.2:3b)
                         ↕
                       Redis (rate limiting + session memory)

Core components:

Layer	Technology	Role
API	FastAPI	Async HTTP, SSE streaming, rate limiting middleware
Agent	LangGraph	Orchestrates agent graph (up to 25 iterations)
LLM	Ollama llama3.2:3b	Local inference, ~2GB model, fully private
Skills	4 custom tools	think, plan, web_search, write_report
Memory	LangGraph MemorySaver	Multi-turn session persistence
Rate Limiting	Redis + fastapi-limiter	10 req/60s per IP

The Four Skills

Each skill is a Python @tool async function paired with a SKILL.md file containing YAML frontmatter. The DeepAgent framework discovers skills at runtime by scanning the skills directory — no hardcoded registration needed.

1. Think — Chain-of-Thought Reasoning

The most fundamental skill. It structures the agent's internal reasoning into a predictable format:

[Question] → [Reasoning steps] → [Conclusion]

Useful for complex analysis where you want the agent to show its work rather than jump straight to an answer.

2. Plan — Task Decomposition

Breaks high-level goals into ordered subtasks with tool assignments and success criteria. Given a goal and optional context, it returns a numbered markdown plan:

1. Search for recent papers on the topic (web_search)
2. Identify key themes and findings (think)
3. Structure a professional summary (write_report)

3. Web Search — Real-Time DuckDuckGo

Executes live searches without requiring an API key. The implementation runs in asyncio.to_thread() to avoid blocking the async event loop — a critical detail when you have concurrent requests.

@tool
async def web_search(query: str) -> str:
    """Search the web using DuckDuckGo."""
    results = await asyncio.to_thread(_ddg_search, query, max_results=10)
    return format_results(results)

4. Write Report — Structured Output

A pure formatting skill that wraps research into professional markdown templates: executive summary, key findings, analysis, conclusion — with an automatic timestamp appended.

SSE Streaming: Watch the Agent Think

Here's the end-to-end shape of a single request and the SSE event timeline the client observes:

DeepAgent request + SSE streaming flow — sequence diagram from client through FastAPI, LangGraph, Ollama, and Redis

The /api/v1/agent/stream endpoint streams events as the agent works. Instead of waiting for a complete answer, the client receives a live event stream:

event: token
data: "Let me search for recent developments..."

event: tool_start
data: web_search

event: tool_end
data: "Found 8 relevant results about..."

event: token
data: "Based on the research, here are the key findings..."

event: done
data: {"session_id": "550e8400-e29b-41d4-a716-446655440000"}

Five event types cover the full lifecycle:

token — Individual LLM tokens as they generate
tool_start — Name of the skill being invoked
tool_end — Truncated skill output (max 500 chars)
done — Final JSON payload with session_id
error — Error message if execution fails

The client can render all of this progressively — showing the agent's reasoning, tool calls, and final answer as they happen rather than after a long wait.

Multi-Turn Session Memory

LangGraph's MemorySaver persists conversation state across requests. The pattern is simple:

# First request — no session_id needed
POST /api/v1/agent/run
{"query": "Research the latest trends in vector databases"}
# Response includes: {"session_id": "abc-123", "answer": "..."}

# Follow-up request — pass the session_id back
POST /api/v1/agent/run
{"query": "Now write a report based on what you found", "session_id": "abc-123"}
# Agent has full context of the previous research turn

The agent maintains the full message history, so follow-ups like "summarise that" or "go deeper on point 3" work naturally without re-sending context.

Redis Rate Limiting

fastapi-limiter wraps Redis to enforce per-IP limits at the middleware layer:

from fastapi_limiter.depends import RateLimiter

@router.post("/run")
async def run_agent(
    request: AgentRequest,
    _: None = Depends(RateLimiter(times=10, seconds=60))
):
    ...

Default: 10 requests per 60-second window per IP
Fully configurable via RATE_LIMIT_REQUESTS / RATE_LIMIT_SECONDS env vars
Returns HTTP 429 when exceeded
Counters auto-expire in Redis — no cleanup needed

API Endpoints

POST /api/v1/agent/run — Synchronous execution

curl -X POST http://localhost:8000/api/v1/agent/run \
  -H "Content-Type: application/json" \
  -d '{"query": "Research AI agents and write a report"}'

# Response:
{
  "answer": "...",
  "tool_calls": [
    {"tool": "web_search", "input": "AI agent frameworks 2026", "output": "..."},
    {"tool": "write_report", "input": "...", "output": "..."}
  ],
  "iterations": 4,
  "elapsed_seconds": 8.3,
  "session_id": "550e8400-..."
}

GET /api/v1/agent/stream?query=...&session_id=... — SSE streaming

curl "http://localhost:8000/api/v1/agent/stream?query=Research+vector+databases"

GET /health/ready — Readiness check (Ollama + Redis connectivity)

Deployment

Three services, one command:

git clone https://github.com/aiwithvd/langchain_deepagent.git
cd langchain_deepagent
cp .env.example .env
docker compose up --build

Docker Compose orchestrates:

App — FastAPI on port 8000
Ollama — Pulls llama3.2:3b (~2GB) automatically on first run
Redis — Rate limiting and session storage

The multi-stage Dockerfile separates build dependencies from the runtime image, keeping the production container lean.

Production Patterns Worth Stealing

1. Thread-safe agent singleton The DeepAgent instance is created once at startup via a factory pattern and shared across requests — avoids repeated model loading overhead.

2. Async-safe web search asyncio.to_thread() prevents the synchronous DuckDuckGo library from blocking the event loop under concurrent load.

3. Structured logging structlog outputs colored text in debug mode and machine-readable JSON in production — same code, different config.

4. Graceful readiness degradation /health/ready returns 200 even when Ollama or Redis is unreachable (with status: degraded), letting your orchestrator decide whether to route traffic rather than killing the container.

Key Takeaways

LangGraph is the right abstraction — Graph-based agent orchestration handles retries, branching, and iteration limits cleanly vs hand-rolled loops
SSE > polling for agents — Streaming tokens and tool events gives users observable progress; polling forces them to wait blind
Session memory is table stakes — Multi-turn context makes agents genuinely useful for research and writing workflows
Local LLMs are production-viable — llama3.2:3b on a single machine handles structured tasks reliably with zero API costs
Rate limit at the middleware layer — Redis + fastapi-limiter is simpler and more reliable than application-level throttling

Code & Resources

Full source code: github.com/aiwithvd/langchain_deepagent
Docker setup: One-command deployment with docker compose up
Skill system: Add new capabilities by dropping a .py + SKILL.md into the skills directory

Want to build production AI agents? Let's connect. I advise teams on agent architecture, LLM integration, and production deployment.