Memory Management for AI Agents: Context Window Optimization Without Token Bankruptcy
Memory Management for AI Agents: Context Window Optimization Without Token Bankruptcy
I got the invoice at the end of March: $2,847 for LLM API calls. For a side project. The culprit? I was appending every conversation turn to the context window. After 50 multi-turn sessions, I was sending 50,000 tokens per request.
You cannot just append everything to the prompt forever. Context windows have limits (128k for GPT-4) and costs scale linearly with token count. In this post, I'll show you three memory management strategies that cut my costs by 70% while improving response quality.
Understanding the Problem
The math:
- GPT-4: ~$0.03 per 1K input tokens, $0.06 per 1K output tokens
- 4 characters ≈ 1 token (rough estimate)
- A 10-turn conversation history: ~3,000 tokens
- Cost per request: 3,000 × $0.03/1K = $0.09
- At 1,000 requests/day: $90/day = $2,700/month
The constraint:
- GPT-4 context window: 128k tokens
- Claude 3: 200k tokens
- But costs don't care about limits—you pay for every token
In the DeepAgent project, I implemented multi-turn conversation support with LangGraph. Without memory management, sessions with 20+ turns would cost $0.50+ per message.
Memory Strategy #1: Sliding Window
Keep only the last N messages. Simple, predictable, effective for short sessions.
Implementation:
from typing import List, Dict from dataclasses import dataclass @dataclass class Message: role: str # "user" or "assistant" content: str timestamp: float class SlidingWindowMemory: def __init__(self, max_messages: int = 10): self.max_messages = max_messages self.messages: List[Message] = [] def add_message(self, role: str, content: str): """Add message and maintain window size.""" import time message = Message(role=role, content=content, timestamp=time.time()) self.messages.append(message) # Remove oldest if over limit if len(self.messages) > self.max_messages: self.messages = self.messages[-self.max_messages:] def get_context(self) -> List[Dict[str, str]]: """Return messages formatted for LLM API.""" return [ {"role": msg.role, "content": msg.content} for msg in self.messages ] def estimate_tokens(self) -> int: """Rough token estimation.""" total_chars = sum(len(msg.content) for msg in self.messages) return total_chars // 4 # ~4 chars per token # Usage with LangGraph from langgraph.graph import StateGraph class AgentState: def __init__(self): self.memory = SlidingWindowMemory(max_messages=10) self.session_id: str = "" def chat_node(state: AgentState, user_input: str): # Add user message state.memory.add_message("user", user_input) # Get sliding window context messages = state.memory.get_context() # Call LLM with limited context response = llm_client.chat.completions.create( model="gpt-4", messages=messages ) # Store assistant response state.memory.add_message("assistant", response.choices[0].message.content) return response
Trade-offs:
- ✅ Simple to implement
- ✅ Predictable token costs
- ✅ Low latency (no retrieval overhead)
- ❌ Loses older context
- ❌ Can't reference earlier conversation facts
- ❌ Breaks long-running sessions (>20 turns)
When to use:
- Short sessions (<10 turns)
- Real-time latency requirements
- Stateless conversation patterns
- Cost-sensitive applications
Memory Strategy #2: Summarization
Once context reaches a threshold, summarize older messages and replace them with the summary.
Implementation:
class SummarizationMemory: def __init__( self, max_messages: int = 10, token_threshold: int = 3000, summary_model: str = "gpt-3.5-turbo" # Cheaper for summaries ): self.max_messages = max_messages self.token_threshold = token_threshold self.summary_model = summary_model self.messages: List[Message] = [] self.summary: str = "" def add_message(self, role: str, content: str): import time message = Message(role=role, content=content, timestamp=time.time()) self.messages.append(message) # Check if we need to summarize if self.estimate_tokens() > self.token_threshold: self._summarize_older_messages() def _summarize_older_messages(self): """Summarize oldest half of messages.""" if len(self.messages) < 4: return # Take oldest 50% for summarization to_summarize = self.messages[:len(self.messages)//2] to_keep = self.messages[len(self.messages)//2:] # Create summary prompt conversation_text = "\n".join([ f"{msg.role}: {msg.content}" for msg in to_summarize ]) summary_prompt = f"""Summarize this conversation concisely, preserving key facts and context: {conversation_text} Summary:""" # Generate summary with cheaper model summary_response = llm_client.chat.completions.create( model=self.summary_model, messages=[{"role": "user", "content": summary_prompt}], max_tokens=200 ) new_summary = summary_response.choices[0].message.content # Combine with existing summary if self.summary: self.summary = f"Previous context: {self.summary}\nRecent: {new_summary}" else: self.summary = new_summary # Replace summarized messages self.messages = to_keep def get_context(self) -> List[Dict[str, str]]: """Return summary + recent messages.""" context = [] if self.summary: context.append({ "role": "system", "content": f"Previous conversation summary: {self.summary}" }) context.extend([ {"role": msg.role, "content": msg.content} for msg in self.messages ]) return context def estimate_tokens(self) -> int: summary_tokens = len(self.summary) // 4 if self.summary else 0 message_tokens = sum(len(msg.content) // 4 for msg in self.messages) return summary_tokens + message_tokens
Cost analysis:
- Summarization call: ~200 tokens in, 150 tokens out
- Cost per summary: $0.006 + $0.009 = $0.015
- Savings: Reduces 4,000 tokens to 2,000 tokens (50%)
- Break-even: One summarization saves ~$0.06, pays for itself in 4 turns
Trade-offs:
- ✅ Retains key facts from long conversations
- ✅ Reduces token costs significantly
- ✅ Works for medium-length sessions
- ❌ Adds latency (background LLM call)
- ❌ Summary can lose nuance
- ❌ Additional cost for summary generation
When to use:
- Medium sessions (10-30 turns)
- Need to preserve key facts
- Can tolerate ~500ms latency hit
- Long-form conversations (interviews, consultations)
Memory Strategy #3: Vectorized Memory (The Smart Approach)
Store conversation embeddings in a vector database. Retrieve only semantically relevant past messages for each turn.
Architecture:
from sentence_transformers import SentenceTransformer import numpy as np from typing import List, Dict, Tuple class VectorizedMemory: def __init__( self, embedding_model: str = "all-MiniLM-L6-v2", vector_db_client=None, # Pinecone, Milvus, or Chroma top_k: int = 5 ): self.embedding_model = SentenceTransformer(embedding_model) self.vector_db = vector_db_client self.top_k = top_k self.recent_messages: List[Message] = [] # Keep last 3 for recency self.max_recent = 3 def add_message(self, role: str, content: str, session_id: str): """Store message in vector DB with embedding.""" import time message = Message(role=role, content=content, timestamp=time.time()) # Generate embedding embedding = self.embedding_model.encode(content) # Store in vector DB message_id = f"{session_id}_{int(time.time() * 1000)}" self.vector_db.upsert( vectors=[{ "id": message_id, "values": embedding.tolist(), "metadata": { "role": role, "content": content, "timestamp": message.timestamp, "session_id": session_id } }] ) # Maintain recent messages self.recent_messages.append(message) if len(self.recent_messages) > self.max_recent: self.recent_messages = self.recent_messages[-self.max_recent:] def get_relevant_context( self, current_query: str, session_id: str ) -> List[Dict[str, str]]: """Retrieve semantically relevant past messages.""" # Get query embedding query_embedding = self.embedding_model.encode(current_query) # Search vector DB results = self.vector_db.query( vector=query_embedding.tolist(), top_k=self.top_k, filter={"session_id": session_id} # Only this session ) # Format results relevant_messages = [] for match in results.matches: relevant_messages.append({ "role": match.metadata["role"], "content": match.metadata["content"] }) # Combine with recent messages (recency bias) recent_context = [ {"role": msg.role, "content": msg.content} for msg in self.recent_messages ] # Deduplicate while preserving order seen = set() combined = [] for msg in recent_context + relevant_messages: key = (msg["role"], msg["content"]) if key not in seen: seen.add(key) combined.append(msg) return combined def estimate_tokens(self, context: List[Dict[str, str]]) -> int: """Estimate tokens for context.""" total_chars = sum(len(msg["content"]) for msg in context) return total_chars // 4 # Usage with LangGraph class LangGraphVectorMemory: """Vectorized memory for LangGraph state management.""" def __init__(self, session_id: str): self.session_id = session_id self.memory = VectorizedMemory() self.short_term = [] # Last 2 turns for immediate context def add_turn(self, user_input: str, assistant_response: str): """Add conversation turn to memory.""" self.memory.add_message("user", user_input, self.session_id) self.memory.add_message("assistant", assistant_response, self.session_id) # Update short-term memory self.short_term.append({"role": "user", "content": user_input}) self.short_term.append({"role": "assistant", "content": assistant_response}) if len(self.short_term) > 4: # Keep last 2 exchanges self.short_term = self.short_term[-4:] def get_context_for_query(self, query: str) -> List[Dict[str, str]]: """Get context combining short-term + relevant historical.""" # Get semantically relevant historical messages relevant = self.memory.get_relevant_context(query, self.session_id) # Combine with short-term (recency is always included) context = self.short_term + relevant return context
Cost analysis:
- Vector DB: ~$0.10/1M vectors (Pinecone) or free (Chroma local)
- Embedding model: Free (local) or $0.0001/1K tokens (OpenAI)
- Typical retrieval: Top 5 messages × 200 tokens = 1,000 tokens
- Savings: 60-70% vs full conversation history
Trade-offs:
- ✅ Most token-efficient for long sessions
- ✅ Retrieves relevant context, ignores noise
- ✅ Scales to 100+ turn sessions
- ❌ Adds complexity (vector DB infrastructure)
- ❌ Retrieval latency (~50-100ms)
- ❌ Requires embedding model
When to use:
- Long sessions (20+ turns)
- Need to reference specific past facts
- Domain-specific conversations (support, consultations)
- When you have vector DB infrastructure
The Hybrid Approach: What Actually Works in Production
After testing all three in production, here's the winning combination:
class HybridMemoryManager: """ Production memory strategy: - Always keep last 3 turns (immediate context) - Summarize turns 4-10 (medium-term context) - Vector retrieval for turns 11+ (long-term context) """ def __init__(self, session_id: str): self.session_id = session_id # Short-term: Sliding window (always included) self.short_term = SlidingWindowMemory(max_messages=6) # 3 turns # Medium-term: Summarization self.summary = "" self.messages_since_summary = 0 # Long-term: Vectorized storage self.vector_memory = VectorizedMemory() def add_message(self, role: str, content: str): import time # Always add to short-term self.short_term.add_message(role, content) # Add to vector DB for long-term retrieval self.vector_memory.add_message(role, content, self.session_id) self.messages_since_summary += 1 # Trigger summarization every 4 messages (2 turns) if self.messages_since_summary >= 4 and len(self.short_term.messages) >= 6: self._update_summary() def _update_summary(self): """Summarize older messages.""" # Get messages to summarize (oldest 2 turns) to_summarize = self.short_term.messages[:4] conversation = "\n".join([ f"{msg.role}: {msg.content}" for msg in to_summarize ]) summary_prompt = f"Summarize key facts from this conversation:\n\n{conversation}" # Use cheaper model for summary response = llm_client.chat.completions.create( model="gpt-3.5-turbo", messages=[{"role": "user", "content": summary_prompt}], max_tokens=150 ) new_summary = response.choices[0].message.content if self.summary: self.summary = f"{self.summary} | {new_summary}" else: self.summary = new_summary self.messages_since_summary = 0 def get_context(self, current_query: str = None) -> List[Dict[str, str]]: """Build context from all three memory layers.""" context = [] # Layer 1: Summary (if exists) if self.summary: context.append({ "role": "system", "content": f"Previous conversation context: {self.summary}" }) # Layer 2: Short-term (recent 3 turns) context.extend(self.short_term.get_context()) # Layer 3: Vector retrieval (if query provided) if current_query: relevant = self.vector_memory.get_relevant_context( current_query, self.session_id ) # Filter out messages already in short-term short_term_contents = {msg.content for msg in self.short_term.messages} for msg in relevant: if msg["content"] not in short_term_contents: context.append(msg) return context def estimate_cost_savings(self) -> Dict[str, float]: """Calculate cost vs naive approach.""" # Assume 50-turn conversation naive_tokens = 50 * 300 # 300 tokens per message avg # Hybrid approach summary_tokens = len(self.summary) // 4 if self.summary else 0 short_term_tokens = self.short_term.estimate_tokens() vector_retrieval_tokens = 5 * 200 # Top 5 × 200 tokens total_tokens = summary_tokens + short_term_tokens + vector_retrieval_tokens savings = naive_tokens - total_tokens cost_savings = (savings / 1000) * 0.03 # $0.03 per 1K tokens return { "naive_tokens": naive_tokens, "hybrid_tokens": total_tokens, "savings_tokens": savings, "cost_savings_per_request": cost_savings, "percent_reduction": (savings / naive_tokens) * 100 }
Real Results from DeepAgent
Before (no memory management):
- 20-turn session: ~6,000 tokens per request
- Cost per session: $0.18
- At 500 sessions/day: $90/day = $2,700/month
After (hybrid approach):
- 20-turn session: ~2,500 tokens per request
- Cost per session: $0.075
- At 500 sessions/day: $37.50/day = $1,125/month
- Savings: $1,575/month (58% reduction)
Latency impact:
- Vector retrieval: +80ms
- Summary generation: +400ms (every 4 messages)
- Net impact: +~100ms average (acceptable for chat)
Decision Framework: Which Strategy to Use?
| Criteria | Sliding Window | Summarization | Vectorized | Hybrid |
|---|---|---|---|---|
| Session length | < 10 turns | 10-30 turns | 20+ turns | Any length |
| Latency requirement | < 200ms | < 500ms | < 300ms | < 400ms |
| Cost sensitivity | High | Medium | High | High |
| Infrastructure | Minimal | Minimal | Vector DB | Vector DB |
| Context quality | Low | Medium | High | Highest |
My recommendation:
- Start with Sliding Window (simple, effective)
- Add Summarization at month 2 (when sessions get longer)
- Migrate to Hybrid at month 6 (when you have vector DB infrastructure)
Implementation with LangGraph
LangGraph's built-in memory support makes this easier:
from langgraph.graph import StateGraph from langgraph.checkpoint.memory import MemorySaver from langchain_core.messages import HumanMessage, AIMessage # Configure memory memory = MemorySaver() # Build graph with checkpointing workflow = StateGraph(AgentState) workflow.add_node("agent", agent_node) workflow.add_node("tools", tool_node) # Add memory checkpointing app = workflow.compile(checkpointer=memory) # Each thread gets isolated memory thread_id = "user_123_session_456" config = {"configurable": {"thread_id": thread_id}} # Run with automatic memory management result = app.invoke( {"messages": [HumanMessage(content="Hello")]}, config=config ) # Memory persists across invocations result2 = app.invoke( {"messages": [HumanMessage(content="What did I ask earlier?")]}, config=config # Same thread_id = same memory )
The "Don't" List
Learn from my mistakes:
- ❌ Don't append every message forever (bankruptcy)
- ❌ Don't use sliding window for long-form consultations (lose critical context)
- ❌ Don't summarize with GPT-4 (use GPT-3.5 or local models)
- ❌ Don't store embeddings without metadata filtering (retrieval quality degrades)
- ❌ Don't forget to clear memory between sessions (privacy, cost)
Production Checklist
Before shipping multi-turn agents:
- Memory strategy selected based on use case
- Token usage monitored per session
- Cost alerts set (> $0.10 per session)
- Session memory cleared on logout/timeout
- Vector DB indexed with session_id filter
- Fallback to sliding window if vector DB fails
- Embedding model cached (don't reload per request)
Next Steps
Memory management keeps costs down. But how do you know if your agent is actually good? In the next post, I'll cover evaluation frameworks—LLM-as-judge, deterministic checks, and A/B testing for generative AI.
Code examples: DeepAgent GitHub
Related: