Back to Blog
AI Agent MemoryContext WindowLangGraphVectorized MemoryToken OptimizationMulti-Turn Conversations

Memory Management for AI Agents: Context Window Optimization Without Token Bankruptcy

Published April 23, 202611 min read
Memory Management for AI Agents: Context Window Optimization Without Token Bankruptcy

Memory Management for AI Agents: Context Window Optimization Without Token Bankruptcy

I got the invoice at the end of March: $2,847 for LLM API calls. For a side project. The culprit? I was appending every conversation turn to the context window. After 50 multi-turn sessions, I was sending 50,000 tokens per request.

You cannot just append everything to the prompt forever. Context windows have limits (128k for GPT-4) and costs scale linearly with token count. In this post, I'll show you three memory management strategies that cut my costs by 70% while improving response quality.

Understanding the Problem

The math:

  • GPT-4: ~$0.03 per 1K input tokens, $0.06 per 1K output tokens
  • 4 characters ≈ 1 token (rough estimate)
  • A 10-turn conversation history: ~3,000 tokens
  • Cost per request: 3,000 × $0.03/1K = $0.09
  • At 1,000 requests/day: $90/day = $2,700/month

The constraint:

  • GPT-4 context window: 128k tokens
  • Claude 3: 200k tokens
  • But costs don't care about limits—you pay for every token

In the DeepAgent project, I implemented multi-turn conversation support with LangGraph. Without memory management, sessions with 20+ turns would cost $0.50+ per message.

Memory Strategy #1: Sliding Window

Keep only the last N messages. Simple, predictable, effective for short sessions.

Sliding Window Memory

Implementation:

from typing import List, Dict
from dataclasses import dataclass

@dataclass
class Message:
    role: str  # "user" or "assistant"
    content: str
    timestamp: float

class SlidingWindowMemory:
    def __init__(self, max_messages: int = 10):
        self.max_messages = max_messages
        self.messages: List[Message] = []
    
    def add_message(self, role: str, content: str):
        """Add message and maintain window size."""
        import time
        
        message = Message(role=role, content=content, timestamp=time.time())
        self.messages.append(message)
        
        # Remove oldest if over limit
        if len(self.messages) > self.max_messages:
            self.messages = self.messages[-self.max_messages:]
    
    def get_context(self) -> List[Dict[str, str]]:
        """Return messages formatted for LLM API."""
        return [
            {"role": msg.role, "content": msg.content}
            for msg in self.messages
        ]
    
    def estimate_tokens(self) -> int:
        """Rough token estimation."""
        total_chars = sum(len(msg.content) for msg in self.messages)
        return total_chars // 4  # ~4 chars per token

# Usage with LangGraph
from langgraph.graph import StateGraph

class AgentState:
    def __init__(self):
        self.memory = SlidingWindowMemory(max_messages=10)
        self.session_id: str = ""

def chat_node(state: AgentState, user_input: str):
    # Add user message
    state.memory.add_message("user", user_input)
    
    # Get sliding window context
    messages = state.memory.get_context()
    
    # Call LLM with limited context
    response = llm_client.chat.completions.create(
        model="gpt-4",
        messages=messages
    )
    
    # Store assistant response
    state.memory.add_message("assistant", response.choices[0].message.content)
    
    return response

Trade-offs:

  • ✅ Simple to implement
  • ✅ Predictable token costs
  • ✅ Low latency (no retrieval overhead)
  • ❌ Loses older context
  • ❌ Can't reference earlier conversation facts
  • ❌ Breaks long-running sessions (>20 turns)

When to use:

  • Short sessions (<10 turns)
  • Real-time latency requirements
  • Stateless conversation patterns
  • Cost-sensitive applications

Memory Strategy #2: Summarization

Once context reaches a threshold, summarize older messages and replace them with the summary.

Summarization Memory

Implementation:

class SummarizationMemory:
    def __init__(
        self,
        max_messages: int = 10,
        token_threshold: int = 3000,
        summary_model: str = "gpt-3.5-turbo"  # Cheaper for summaries
    ):
        self.max_messages = max_messages
        self.token_threshold = token_threshold
        self.summary_model = summary_model
        self.messages: List[Message] = []
        self.summary: str = ""
    
    def add_message(self, role: str, content: str):
        import time
        message = Message(role=role, content=content, timestamp=time.time())
        self.messages.append(message)
        
        # Check if we need to summarize
        if self.estimate_tokens() > self.token_threshold:
            self._summarize_older_messages()
    
    def _summarize_older_messages(self):
        """Summarize oldest half of messages."""
        if len(self.messages) < 4:
            return
        
        # Take oldest 50% for summarization
        to_summarize = self.messages[:len(self.messages)//2]
        to_keep = self.messages[len(self.messages)//2:]
        
        # Create summary prompt
        conversation_text = "\n".join([
            f"{msg.role}: {msg.content}" 
            for msg in to_summarize
        ])
        
        summary_prompt = f"""Summarize this conversation concisely, preserving key facts and context:

{conversation_text}

Summary:"""
        
        # Generate summary with cheaper model
        summary_response = llm_client.chat.completions.create(
            model=self.summary_model,
            messages=[{"role": "user", "content": summary_prompt}],
            max_tokens=200
        )
        
        new_summary = summary_response.choices[0].message.content
        
        # Combine with existing summary
        if self.summary:
            self.summary = f"Previous context: {self.summary}\nRecent: {new_summary}"
        else:
            self.summary = new_summary
        
        # Replace summarized messages
        self.messages = to_keep
    
    def get_context(self) -> List[Dict[str, str]]:
        """Return summary + recent messages."""
        context = []
        
        if self.summary:
            context.append({
                "role": "system",
                "content": f"Previous conversation summary: {self.summary}"
            })
        
        context.extend([
            {"role": msg.role, "content": msg.content}
            for msg in self.messages
        ])
        
        return context
    
    def estimate_tokens(self) -> int:
        summary_tokens = len(self.summary) // 4 if self.summary else 0
        message_tokens = sum(len(msg.content) // 4 for msg in self.messages)
        return summary_tokens + message_tokens

Cost analysis:

  • Summarization call: ~200 tokens in, 150 tokens out
  • Cost per summary: $0.006 + $0.009 = $0.015
  • Savings: Reduces 4,000 tokens to 2,000 tokens (50%)
  • Break-even: One summarization saves ~$0.06, pays for itself in 4 turns

Trade-offs:

  • ✅ Retains key facts from long conversations
  • ✅ Reduces token costs significantly
  • ✅ Works for medium-length sessions
  • ❌ Adds latency (background LLM call)
  • ❌ Summary can lose nuance
  • ❌ Additional cost for summary generation

When to use:

  • Medium sessions (10-30 turns)
  • Need to preserve key facts
  • Can tolerate ~500ms latency hit
  • Long-form conversations (interviews, consultations)

Memory Strategy #3: Vectorized Memory (The Smart Approach)

Store conversation embeddings in a vector database. Retrieve only semantically relevant past messages for each turn.

Vectorized Memory

Architecture:

from sentence_transformers import SentenceTransformer
import numpy as np
from typing import List, Dict, Tuple

class VectorizedMemory:
    def __init__(
        self,
        embedding_model: str = "all-MiniLM-L6-v2",
        vector_db_client=None,  # Pinecone, Milvus, or Chroma
        top_k: int = 5
    ):
        self.embedding_model = SentenceTransformer(embedding_model)
        self.vector_db = vector_db_client
        self.top_k = top_k
        self.recent_messages: List[Message] = []  # Keep last 3 for recency
        self.max_recent = 3
    
    def add_message(self, role: str, content: str, session_id: str):
        """Store message in vector DB with embedding."""
        import time
        
        message = Message(role=role, content=content, timestamp=time.time())
        
        # Generate embedding
        embedding = self.embedding_model.encode(content)
        
        # Store in vector DB
        message_id = f"{session_id}_{int(time.time() * 1000)}"
        self.vector_db.upsert(
            vectors=[{
                "id": message_id,
                "values": embedding.tolist(),
                "metadata": {
                    "role": role,
                    "content": content,
                    "timestamp": message.timestamp,
                    "session_id": session_id
                }
            }]
        )
        
        # Maintain recent messages
        self.recent_messages.append(message)
        if len(self.recent_messages) > self.max_recent:
            self.recent_messages = self.recent_messages[-self.max_recent:]
    
    def get_relevant_context(
        self,
        current_query: str,
        session_id: str
    ) -> List[Dict[str, str]]:
        """Retrieve semantically relevant past messages."""
        
        # Get query embedding
        query_embedding = self.embedding_model.encode(current_query)
        
        # Search vector DB
        results = self.vector_db.query(
            vector=query_embedding.tolist(),
            top_k=self.top_k,
            filter={"session_id": session_id}  # Only this session
        )
        
        # Format results
        relevant_messages = []
        for match in results.matches:
            relevant_messages.append({
                "role": match.metadata["role"],
                "content": match.metadata["content"]
            })
        
        # Combine with recent messages (recency bias)
        recent_context = [
            {"role": msg.role, "content": msg.content}
            for msg in self.recent_messages
        ]
        
        # Deduplicate while preserving order
        seen = set()
        combined = []
        for msg in recent_context + relevant_messages:
            key = (msg["role"], msg["content"])
            if key not in seen:
                seen.add(key)
                combined.append(msg)
        
        return combined
    
    def estimate_tokens(self, context: List[Dict[str, str]]) -> int:
        """Estimate tokens for context."""
        total_chars = sum(len(msg["content"]) for msg in context)
        return total_chars // 4

# Usage with LangGraph
class LangGraphVectorMemory:
    """Vectorized memory for LangGraph state management."""
    
    def __init__(self, session_id: str):
        self.session_id = session_id
        self.memory = VectorizedMemory()
        self.short_term = []  # Last 2 turns for immediate context
    
    def add_turn(self, user_input: str, assistant_response: str):
        """Add conversation turn to memory."""
        self.memory.add_message("user", user_input, self.session_id)
        self.memory.add_message("assistant", assistant_response, self.session_id)
        
        # Update short-term memory
        self.short_term.append({"role": "user", "content": user_input})
        self.short_term.append({"role": "assistant", "content": assistant_response})
        if len(self.short_term) > 4:  # Keep last 2 exchanges
            self.short_term = self.short_term[-4:]
    
    def get_context_for_query(self, query: str) -> List[Dict[str, str]]:
        """Get context combining short-term + relevant historical."""
        # Get semantically relevant historical messages
        relevant = self.memory.get_relevant_context(query, self.session_id)
        
        # Combine with short-term (recency is always included)
        context = self.short_term + relevant
        
        return context

Cost analysis:

  • Vector DB: ~$0.10/1M vectors (Pinecone) or free (Chroma local)
  • Embedding model: Free (local) or $0.0001/1K tokens (OpenAI)
  • Typical retrieval: Top 5 messages × 200 tokens = 1,000 tokens
  • Savings: 60-70% vs full conversation history

Trade-offs:

  • ✅ Most token-efficient for long sessions
  • ✅ Retrieves relevant context, ignores noise
  • ✅ Scales to 100+ turn sessions
  • ❌ Adds complexity (vector DB infrastructure)
  • ❌ Retrieval latency (~50-100ms)
  • ❌ Requires embedding model

When to use:

  • Long sessions (20+ turns)
  • Need to reference specific past facts
  • Domain-specific conversations (support, consultations)
  • When you have vector DB infrastructure

The Hybrid Approach: What Actually Works in Production

After testing all three in production, here's the winning combination:

Hybrid Memory Architecture

class HybridMemoryManager:
    """
    Production memory strategy:
    - Always keep last 3 turns (immediate context)
    - Summarize turns 4-10 (medium-term context)
    - Vector retrieval for turns 11+ (long-term context)
    """
    
    def __init__(self, session_id: str):
        self.session_id = session_id
        
        # Short-term: Sliding window (always included)
        self.short_term = SlidingWindowMemory(max_messages=6)  # 3 turns
        
        # Medium-term: Summarization
        self.summary = ""
        self.messages_since_summary = 0
        
        # Long-term: Vectorized storage
        self.vector_memory = VectorizedMemory()
    
    def add_message(self, role: str, content: str):
        import time
        
        # Always add to short-term
        self.short_term.add_message(role, content)
        
        # Add to vector DB for long-term retrieval
        self.vector_memory.add_message(role, content, self.session_id)
        
        self.messages_since_summary += 1
        
        # Trigger summarization every 4 messages (2 turns)
        if self.messages_since_summary >= 4 and len(self.short_term.messages) >= 6:
            self._update_summary()
    
    def _update_summary(self):
        """Summarize older messages."""
        # Get messages to summarize (oldest 2 turns)
        to_summarize = self.short_term.messages[:4]
        
        conversation = "\n".join([
            f"{msg.role}: {msg.content}" for msg in to_summarize
        ])
        
        summary_prompt = f"Summarize key facts from this conversation:\n\n{conversation}"
        
        # Use cheaper model for summary
        response = llm_client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": summary_prompt}],
            max_tokens=150
        )
        
        new_summary = response.choices[0].message.content
        
        if self.summary:
            self.summary = f"{self.summary} | {new_summary}"
        else:
            self.summary = new_summary
        
        self.messages_since_summary = 0
    
    def get_context(self, current_query: str = None) -> List[Dict[str, str]]:
        """Build context from all three memory layers."""
        context = []
        
        # Layer 1: Summary (if exists)
        if self.summary:
            context.append({
                "role": "system",
                "content": f"Previous conversation context: {self.summary}"
            })
        
        # Layer 2: Short-term (recent 3 turns)
        context.extend(self.short_term.get_context())
        
        # Layer 3: Vector retrieval (if query provided)
        if current_query:
            relevant = self.vector_memory.get_relevant_context(
                current_query, self.session_id
            )
            
            # Filter out messages already in short-term
            short_term_contents = {msg.content for msg in self.short_term.messages}
            for msg in relevant:
                if msg["content"] not in short_term_contents:
                    context.append(msg)
        
        return context
    
    def estimate_cost_savings(self) -> Dict[str, float]:
        """Calculate cost vs naive approach."""
        # Assume 50-turn conversation
        naive_tokens = 50 * 300  # 300 tokens per message avg
        
        # Hybrid approach
        summary_tokens = len(self.summary) // 4 if self.summary else 0
        short_term_tokens = self.short_term.estimate_tokens()
        vector_retrieval_tokens = 5 * 200  # Top 5 × 200 tokens
        
        total_tokens = summary_tokens + short_term_tokens + vector_retrieval_tokens
        
        savings = naive_tokens - total_tokens
        cost_savings = (savings / 1000) * 0.03  # $0.03 per 1K tokens
        
        return {
            "naive_tokens": naive_tokens,
            "hybrid_tokens": total_tokens,
            "savings_tokens": savings,
            "cost_savings_per_request": cost_savings,
            "percent_reduction": (savings / naive_tokens) * 100
        }

Real Results from DeepAgent

Before (no memory management):

  • 20-turn session: ~6,000 tokens per request
  • Cost per session: $0.18
  • At 500 sessions/day: $90/day = $2,700/month

After (hybrid approach):

  • 20-turn session: ~2,500 tokens per request
  • Cost per session: $0.075
  • At 500 sessions/day: $37.50/day = $1,125/month
  • Savings: $1,575/month (58% reduction)

Latency impact:

  • Vector retrieval: +80ms
  • Summary generation: +400ms (every 4 messages)
  • Net impact: +~100ms average (acceptable for chat)

Decision Framework: Which Strategy to Use?

CriteriaSliding WindowSummarizationVectorizedHybrid
Session length< 10 turns10-30 turns20+ turnsAny length
Latency requirement< 200ms< 500ms< 300ms< 400ms
Cost sensitivityHighMediumHighHigh
InfrastructureMinimalMinimalVector DBVector DB
Context qualityLowMediumHighHighest

My recommendation:

  • Start with Sliding Window (simple, effective)
  • Add Summarization at month 2 (when sessions get longer)
  • Migrate to Hybrid at month 6 (when you have vector DB infrastructure)

Implementation with LangGraph

LangGraph's built-in memory support makes this easier:

from langgraph.graph import StateGraph
from langgraph.checkpoint.memory import MemorySaver
from langchain_core.messages import HumanMessage, AIMessage

# Configure memory
memory = MemorySaver()

# Build graph with checkpointing
workflow = StateGraph(AgentState)
workflow.add_node("agent", agent_node)
workflow.add_node("tools", tool_node)

# Add memory checkpointing
app = workflow.compile(checkpointer=memory)

# Each thread gets isolated memory
thread_id = "user_123_session_456"
config = {"configurable": {"thread_id": thread_id}}

# Run with automatic memory management
result = app.invoke(
    {"messages": [HumanMessage(content="Hello")]},
    config=config
)

# Memory persists across invocations
result2 = app.invoke(
    {"messages": [HumanMessage(content="What did I ask earlier?")]},
    config=config  # Same thread_id = same memory
)

The "Don't" List

Learn from my mistakes:

  • Don't append every message forever (bankruptcy)
  • Don't use sliding window for long-form consultations (lose critical context)
  • Don't summarize with GPT-4 (use GPT-3.5 or local models)
  • Don't store embeddings without metadata filtering (retrieval quality degrades)
  • Don't forget to clear memory between sessions (privacy, cost)

Production Checklist

Before shipping multi-turn agents:

  • Memory strategy selected based on use case
  • Token usage monitored per session
  • Cost alerts set (> $0.10 per session)
  • Session memory cleared on logout/timeout
  • Vector DB indexed with session_id filter
  • Fallback to sliding window if vector DB fails
  • Embedding model cached (don't reload per request)

Next Steps

Memory management keeps costs down. But how do you know if your agent is actually good? In the next post, I'll cover evaluation frameworks—LLM-as-judge, deterministic checks, and A/B testing for generative AI.

Code examples: DeepAgent GitHub


Related:

Questions? Email me or connect on LinkedIn.