Building a Production-Ready LangChain DeepAgent with SSE Streaming
Building a Production-Ready LangChain DeepAgent with SSE Streaming
AI agents are everywhere — but most demos stop short of the hard parts: observable reasoning, real-time output, stateful conversations, and production-grade rate limiting. I recently built LangChain DeepAgent, a production-ready agent API that tackles all of these head-on.
In this post I'll walk through the architecture, the four specialized skills, how SSE streaming works in practice, and the production patterns that make it reliable.
The Problem
Most LLM integrations are fire-and-forget: send a query, wait 10–30 seconds, get a response. That's fine for simple Q&A, but it breaks down for autonomous agents that:
- Need to reason, plan, search, and write across multiple steps
- Should stream output in real-time so users see progress, not a spinner
- Must remember context across a multi-turn conversation
- Need abuse protection to be safely exposed to users
LangGraph + FastAPI + SSE solves all of this cleanly.
Architecture Overview
At a glance:
Client → FastAPI → DeepAgent (LangGraph) → 4 Skills
↕
Ollama (llama3.2:3b)
↕
Redis (rate limiting + session memory)
Core components:
| Layer | Technology | Role |
|---|---|---|
| API | FastAPI | Async HTTP, SSE streaming, rate limiting middleware |
| Agent | LangGraph | Orchestrates agent graph (up to 25 iterations) |
| LLM | Ollama llama3.2:3b | Local inference, ~2GB model, fully private |
| Skills | 4 custom tools | think, plan, web_search, write_report |
| Memory | LangGraph MemorySaver | Multi-turn session persistence |
| Rate Limiting | Redis + fastapi-limiter | 10 req/60s per IP |
The Four Skills
Each skill is a Python @tool async function paired with a SKILL.md file containing YAML frontmatter. The DeepAgent framework discovers skills at runtime by scanning the skills directory — no hardcoded registration needed.
1. Think — Chain-of-Thought Reasoning
The most fundamental skill. It structures the agent's internal reasoning into a predictable format:
[Question] → [Reasoning steps] → [Conclusion]
Useful for complex analysis where you want the agent to show its work rather than jump straight to an answer.
2. Plan — Task Decomposition
Breaks high-level goals into ordered subtasks with tool assignments and success criteria. Given a goal and optional context, it returns a numbered markdown plan:
1. Search for recent papers on the topic (web_search)
2. Identify key themes and findings (think)
3. Structure a professional summary (write_report)
3. Web Search — Real-Time DuckDuckGo
Executes live searches without requiring an API key. The implementation runs in asyncio.to_thread() to avoid blocking the async event loop — a critical detail when you have concurrent requests.
@tool async def web_search(query: str) -> str: """Search the web using DuckDuckGo.""" results = await asyncio.to_thread(_ddg_search, query, max_results=10) return format_results(results)
4. Write Report — Structured Output
A pure formatting skill that wraps research into professional markdown templates: executive summary, key findings, analysis, conclusion — with an automatic timestamp appended.
SSE Streaming: Watch the Agent Think
Here's the end-to-end shape of a single request and the SSE event timeline the client observes:
The /api/v1/agent/stream endpoint streams events as the agent works. Instead of waiting for a complete answer, the client receives a live event stream:
event: token
data: "Let me search for recent developments..."
event: tool_start
data: web_search
event: tool_end
data: "Found 8 relevant results about..."
event: token
data: "Based on the research, here are the key findings..."
event: done
data: {"session_id": "550e8400-e29b-41d4-a716-446655440000"}
Five event types cover the full lifecycle:
token— Individual LLM tokens as they generatetool_start— Name of the skill being invokedtool_end— Truncated skill output (max 500 chars)done— Final JSON payload withsession_iderror— Error message if execution fails
The client can render all of this progressively — showing the agent's reasoning, tool calls, and final answer as they happen rather than after a long wait.
Multi-Turn Session Memory
LangGraph's MemorySaver persists conversation state across requests. The pattern is simple:
# First request — no session_id needed POST /api/v1/agent/run {"query": "Research the latest trends in vector databases"} # Response includes: {"session_id": "abc-123", "answer": "..."} # Follow-up request — pass the session_id back POST /api/v1/agent/run {"query": "Now write a report based on what you found", "session_id": "abc-123"} # Agent has full context of the previous research turn
The agent maintains the full message history, so follow-ups like "summarise that" or "go deeper on point 3" work naturally without re-sending context.
Redis Rate Limiting
fastapi-limiter wraps Redis to enforce per-IP limits at the middleware layer:
from fastapi_limiter.depends import RateLimiter @router.post("/run") async def run_agent( request: AgentRequest, _: None = Depends(RateLimiter(times=10, seconds=60)) ): ...
- Default: 10 requests per 60-second window per IP
- Fully configurable via
RATE_LIMIT_REQUESTS/RATE_LIMIT_SECONDSenv vars - Returns HTTP 429 when exceeded
- Counters auto-expire in Redis — no cleanup needed
API Endpoints
POST /api/v1/agent/run — Synchronous execution
curl -X POST http://localhost:8000/api/v1/agent/run \ -H "Content-Type: application/json" \ -d '{"query": "Research AI agents and write a report"}' # Response: { "answer": "...", "tool_calls": [ {"tool": "web_search", "input": "AI agent frameworks 2026", "output": "..."}, {"tool": "write_report", "input": "...", "output": "..."} ], "iterations": 4, "elapsed_seconds": 8.3, "session_id": "550e8400-..." }
GET /api/v1/agent/stream?query=...&session_id=... — SSE streaming
curl "http://localhost:8000/api/v1/agent/stream?query=Research+vector+databases"
GET /health/ready — Readiness check (Ollama + Redis connectivity)
Deployment
Three services, one command:
git clone https://github.com/aiwithvd/langchain_deepagent.git cd langchain_deepagent cp .env.example .env docker compose up --build
Docker Compose orchestrates:
- App — FastAPI on port 8000
- Ollama — Pulls
llama3.2:3b(~2GB) automatically on first run - Redis — Rate limiting and session storage
The multi-stage Dockerfile separates build dependencies from the runtime image, keeping the production container lean.
Production Patterns Worth Stealing
1. Thread-safe agent singleton
The DeepAgent instance is created once at startup via a factory pattern and shared across requests — avoids repeated model loading overhead.
2. Async-safe web search
asyncio.to_thread() prevents the synchronous DuckDuckGo library from blocking the event loop under concurrent load.
3. Structured logging
structlog outputs colored text in debug mode and machine-readable JSON in production — same code, different config.
4. Graceful readiness degradation
/health/ready returns 200 even when Ollama or Redis is unreachable (with status: degraded), letting your orchestrator decide whether to route traffic rather than killing the container.
Key Takeaways
- LangGraph is the right abstraction — Graph-based agent orchestration handles retries, branching, and iteration limits cleanly vs hand-rolled loops
- SSE > polling for agents — Streaming tokens and tool events gives users observable progress; polling forces them to wait blind
- Session memory is table stakes — Multi-turn context makes agents genuinely useful for research and writing workflows
- Local LLMs are production-viable — llama3.2:3b on a single machine handles structured tasks reliably with zero API costs
- Rate limit at the middleware layer — Redis + fastapi-limiter is simpler and more reliable than application-level throttling
Code & Resources
- Full source code: github.com/aiwithvd/langchain_deepagent
- Docker setup: One-command deployment with
docker compose up - Skill system: Add new capabilities by dropping a
.py+SKILL.mdinto the skills directory
Want to build production AI agents? Let's connect. I advise teams on agent architecture, LLM integration, and production deployment.