Back to Blog
Prompt EngineeringRAGFine-TuningLLM StrategyROI AnalysisDecision Framework

Prompt Engineering vs RAG vs Fine-Tuning: The Decision Framework I Actually Use

Published April 25, 202614 min read
Prompt Engineering vs RAG vs Fine-Tuning: The Decision Framework I Actually Use

Prompt Engineering vs RAG vs Fine-Tuning: The Decision Framework I Actually Use

The product manager asked for "an AI feature." Vague, exciting, and terrifying all at once.

I've sat in that meeting too many times. The team wants to ship fast, the PM wants magic, and you're staring at three paths: prompt engineering (cheap, fast), RAG (adds complexity), or fine-tuning (expensive, slow).

Choose wrong and you waste $50K and three months. This is the decision framework I've refined across the Document Extraction Pipeline, DeepAgent, and a half-dozen production systems.

The Decision Matrix

Start with this 2×2 matrix. It narrows your options in 30 seconds.

Decision Matrix

Static KnowledgeDynamic Knowledge
Generic OutputPrompt EngineeringRAG
Specific OutputFine-TuningRAG + Fine-Tuning

Definitions:

  • Static Knowledge: Fixed facts (product docs, legal clauses)
  • Dynamic Knowledge: Frequently updated (stock prices, user data)
  • Generic Output: Standard tone, common formats
  • Specific Output: Brand voice, specialized formatting, unique style

Phase 1: Prompt Engineering (Always Start Here)

Cost: $0.001–0.02 per 1K tokens Time: Hours to days Break-even: Immediate

You should always start here. It's the cheapest way to establish a baseline and understand your problem.

What You Can Achieve with Prompting

# Example: Structured extraction with just prompting
EXTRACTION_PROMPT = """Extract invoice information from the text below.

## Text
{text}

## Instructions
1. Identify the invoice number, date, total amount, and vendor
2. Return ONLY valid JSON in this exact format:
{{
    "invoice_number": "...",
    "date": "YYYY-MM-DD",
    "total_amount": 0.00,
    "vendor": "..."
}}

3. Use null for missing fields
4. Ensure date is ISO format (YYYY-MM-DD)
5. Amount should be a number, not string

## Output"""

response = openai_client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": EXTRACTION_PROMPT.format(text=invoice_text)}],
    response_format={"type": "json_object"}  # Enforce JSON output
)

Advanced prompt engineering techniques:

  1. Chain-of-thought: Ask the model to reason step-by-step
  2. Few-shot examples: Include 2-3 examples in the prompt
  3. Structured outputs: Use JSON schema or function calling
  4. System prompts: Set behavior in the system message
  5. Prompt chaining: Break complex tasks into multiple calls

When to Move On

You know you've hit the prompt engineering ceiling when:

  • Accuracy plateaus (not improving with better prompts)
  • Latency is too high (prompt is 5K+ tokens)
  • Context window overflow (prompt + context > 100K tokens)
  • Consistency issues (outputs vary significantly)
  • Cost per request exceeds budget

From the Document Extraction Pipeline:

  • Started with basic prompting: 72% extraction accuracy
  • Advanced prompting (few-shot, schemas): 85% accuracy
  • Ceiling: Couldn't reach 95% without external knowledge
  • Decision: Move to RAG at month 2

Phase 2: RAG (Dynamic Knowledge)

Cost: 20% latency increase + vector DB costs ($0.10/1M vectors) Time: 1–2 weeks Break-even: When you need proprietary or frequently updated data

RAG (Retrieval-Augmented Generation) adds a knowledge base that the LLM can query.

When RAG is the Right Choice

Use RAG when:

  • Knowledge changes frequently (product docs, policies)
  • You have proprietary data not in training corpus
  • You need source attribution ("according to doc X...")
  • You want to reduce hallucinations with grounded context
  • You need to control information access (user-specific docs)

Don't use RAG when:

  • Knowledge is static and fits in context window
  • You need specific tone/style (use fine-tuning)
  • You're doing code generation (use fine-tuned code models)
  • Retrieval adds unacceptable latency (>200ms)

RAG Architecture

from sentence_transformers import SentenceTransformer
import pinecone

class RAGSystem:
    def __init__(self):
        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
        self.vector_db = pinecone.Index("document-embeddings")
    
    def query(self, question: str, top_k: int = 5) -> str:
        """Answer question using RAG."""
        
        # 1. Embed the question
        query_embedding = self.embedding_model.encode(question)
        
        # 2. Retrieve relevant chunks
        results = self.vector_db.query(
            vector=query_embedding.tolist(),
            top_k=top_k,
            include_metadata=True
        )
        
        # 3. Format context
        contexts = [match.metadata["text"] for match in results.matches]
        context_str = "\n\n---\n\n".join(contexts)
        
        # 4. Generate with context
        prompt = f"""Answer the question using the provided context.
If the answer isn't in the context, say "I don't have that information."

## Context
{context_str}

## Question
{question}

## Answer"""
        
        response = openai_client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )
        
        return response.choices[0].message.content

Chunking Strategy

The #1 mistake in RAG: wrong chunk size.

class DocumentChunker:
    """Chunk documents for optimal retrieval."""
    
    def __init__(
        self,
        chunk_size: int = 500,      # Tokens per chunk
        overlap: int = 100           # Overlap between chunks
    ):
        self.chunk_size = chunk_size
        self.overlap = overlap
    
    def chunk_text(self, text: str) -> List[dict]:
        """Create overlapping chunks."""
        
        # Simple word-based chunking (use tiktoken for token-based)
        words = text.split()
        chunks = []
        
        start = 0
        while start < len(words):
            end = min(start + self.chunk_size, len(words))
            chunk_words = words[start:end]
            
            chunks.append({
                "text": " ".join(chunk_words),
                "start_idx": start,
                "end_idx": end,
                "metadata": {
                    "chunk_index": len(chunks),
                    "total_chunks": None  # Set later
                }
            })
            
            start += self.chunk_size - self.overlap
        
        # Update total
        for chunk in chunks:
            chunk["metadata"]["total_chunks"] = len(chunks)
        
        return chunks

# Optimal chunk sizes by use case:
CHUNK_CONFIGS = {
    "general_qa": {"size": 500, "overlap": 100},
    "legal_documents": {"size": 1000, "overlap": 200},
    "code": {"size": 300, "overlap": 50},
    "technical_docs": {"size": 400, "overlap": 80}
}

When RAG Wasn't Enough

Document Extraction Pipeline case study:

  • RAG improved accuracy to 89%
  • But specialized invoice formats needed specific extraction logic
  • Prompt + RAG reached 92%—still below 95% SLA
  • Decision: Fine-tuned schema-specific extractors at month 4

Phase 3: Fine-Tuning (The Nuclear Option)

Cost: $500–5,000 training + ongoing inference Time: 2–4 weeks Break-even: When you need specific tone/format or want to distil a big model

Fine-tuning adapts a base model to your specific task. It's powerful but expensive.

When Fine-Tuning is the Right Choice

Use fine-tuning when:

  • RAG + prompting can't achieve required accuracy
  • You need specific tone, style, or format consistency
  • You want to reduce latency (distil GPT-4 → smaller model)
  • You want to reduce cost (fine-tuned 7B model vs GPT-4)
  • You have 1,000+ high-quality training examples

Don't fine-tune when:

  • You have <500 training examples (use few-shot prompting)
  • Knowledge is dynamic (use RAG)
  • You need general capabilities (use base model)
  • Budget is constrained (training + serving costs)

Fine-Tuning Example

# 1. Prepare training data
# training_data.jsonl
{"messages": [
    {"role": "system", "content": "You extract invoice data from text."},
    {"role": "user", "content": "Invoice #12345 dated 2024-01-15..."},
    {"role": "assistant", "content": '{"invoice_number": "12345", "date": "2024-01-15", "total": 1500.00}'}
]}

# 2. Upload and train (OpenAI example)
import openai

# Upload training file
with open("training_data.jsonl", "rb") as f:
    file = openai.files.create(file=f, purpose="fine-tune")

# Create fine-tuning job
job = openai.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-3.5-turbo",
    suffix="invoice-extractor"
)

# 3. Use fine-tuned model
response = openai_client.chat.completions.create(
    model="ft:gpt-3.5-turbo:my-org:invoice-extractor:12345",  # Fine-tuned model
    messages=[{"role": "user", "content": invoice_text}]
)

Cost Comparison: Fine-Tuned vs GPT-4

MetricGPT-4Fine-Tuned 7B Model
Training cost$0$2,000–5,000
Inference cost/1K tokens$0.03$0.002
Latency2–5s0.5–1s
Accuracy (invoice extraction)92%96%
Break-even volumeN/A~200K requests

From Document Extraction Pipeline:

  • Fine-tuned on 5,000 labeled invoices
  • Training cost: $3,200
  • Inference cost reduction: 93% vs GPT-4
  • Accuracy improvement: 92% → 96%
  • Break-even: Month 3 at current volume

Distillation: Big → Small

Use GPT-4 to generate training data, then fine-tune a small model:

class DistillationPipeline:
    """Distil GPT-4 into smaller model."""
    
    def generate_training_data(
        self,
        inputs: List[str],
        teacher_model: str = "gpt-4"
    ) -> List[dict]:
        """Generate high-quality training data from teacher model."""
        
        training_examples = []
        
        for input_text in inputs:
            # Get teacher response
            teacher_response = openai_client.chat.completions.create(
                model=teacher_model,
                messages=[{"role": "user", "content": input_text}],
                temperature=0.0  # Deterministic
            )
            
            training_examples.append({
                "messages": [
                    {"role": "user", "content": input_text},
                    {"role": "assistant", "content": teacher_response.choices[0].message.content}
                ]
            })
        
        return training_examples
    
    def fine_tune_student(
        self,
        training_data: List[dict],
        student_base: str = "meta-llama/Llama-2-7b"
    ):
        """Fine-tune student model on teacher-generated data."""
        
        # Use HuggingFace or custom training pipeline
        from transformers import AutoModelForCausalLM, TrainingArguments
        
        model = AutoModelForCausalLM.from_pretrained(student_base)
        
        training_args = TrainingArguments(
            output_dir="./distilled_model",
            num_train_epochs=3,
            per_device_train_batch_size=4,
            learning_rate=2e-5,
            warmup_steps=100
        )
        
        # Train...
        trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=training_data
        )
        
        trainer.train()

The 10-Question Decision Framework

Use this flowchart to decide:

Decision Flowchart

The 10 questions:

  1. Is your knowledge static or dynamic?

    • Static → Consider fine-tuning
    • Dynamic → Use RAG
  2. Do you need specific tone/format/style?

    • Yes → Fine-tuning may be needed
    • No → Start with prompting
  3. How many training examples do you have?

    • <100 → Prompt engineering only
    • 100–500 → Few-shot prompting
    • 500–1000 → RAG + prompting
    • 1000 → Can consider fine-tuning

  4. What accuracy do you need?

    • <85% → Prompt engineering
    • 85–92% → RAG + prompting
    • 92% → Fine-tuning or hybrid

  5. Is latency critical?

    • Yes (<500ms) → Fine-tuned small model
    • No → GPT-4 + RAG
  6. Is cost per request critical?

    • Yes (<$0.01) → Fine-tuned model
    • No → Use best available model
  7. Do you need source attribution?

    • Yes → RAG
    • No → Any approach works
  8. How often does knowledge change?

    • Daily/weekly → RAG
    • Monthly/yearly → Fine-tuning viable
    • Never → Prompt engineering
  9. Do you have engineering resources for infrastructure?

    • Yes → RAG (vector DB)
    • Limited → Prompt engineering or fine-tune via API
  10. What's your timeline?

    • Days → Prompt engineering
    • Weeks → RAG
    • Months → Fine-tuning

Hybrid Approaches (What Actually Works)

The best systems combine all three:

Pattern 1: Fine-Tuned Base + RAG Context

class HybridSystem:
    """Fine-tuned model with RAG context injection."""
    
    def __init__(self):
        self.fine_tuned_model = "ft:gpt-3.5-turbo:custom:123"
        self.rag = RAGSystem()
    
    def generate(self, query: str) -> str:
        # 1. Retrieve relevant context
        contexts = self.rag.retrieve(query, top_k=3)
        
        # 2. Format prompt with context
        prompt = f"""Answer using your training AND the following context:
        
## Context
{contexts}

## Query
{query}

## Answer"""
        
        # 3. Use fine-tuned model (knows task + format)
        response = openai_client.chat.completions.create(
            model=self.fine_tuned_model,
            messages=[{"role": "user", "content": prompt}]
        )
        
        return response.choices[0].message.content

When to use:

  • Need specific format/style (fine-tuning)
  • Knowledge is dynamic (RAG)
  • DeepAgent uses this pattern for skill-specific responses

Pattern 2: Prompt Templates + Vector Retrieval

class TemplateRAGSystem:
    """Prompt templates with dynamic context injection."""
    
    def __init__(self):
        self.templates = {
            "support": "You are a helpful support agent...",
            "sales": "You are a sales consultant...",
            "technical": "You are a technical expert..."
        }
        self.rag = RAGSystem()
    
    def generate(self, query: str, persona: str) -> str:
        # 1. Select template
        system_prompt = self.templates.get(persona, self.templates["support"])
        
        # 2. Retrieve persona-specific context
        contexts = self.rag.retrieve(
            query, 
            top_k=5,
            filter={"category": persona}
        )
        
        # 3. Generate with template + context
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Context: {contexts}\n\nQuestion: {query}"}
        ]
        
        return openai_client.chat.completions.create(
            model="gpt-4",
            messages=messages
        )

When to use:

  • Multiple personas/use cases
  • Shared knowledge base
  • Need consistent tone per persona

Pattern 3: Skills-Based (DeepAgent Pattern)

class SkillsBasedSystem:
    """DeepAgent-style skill routing."""
    
    def __init__(self):
        self.skills = {
            "web_search": WebSearchSkill(),
            "calculator": CalculatorSkill(),
            "document_qa": DocumentQASkill(rag_system),
            "creative_writing": CreativeWritingSkill(fine_tuned_model)
        }
    
    def route_and_execute(self, query: str) -> str:
        # 1. Classify intent
        intent = self.classify_intent(query)
        
        # 2. Route to appropriate skill
        skill = self.skills.get(intent)
        
        if not skill:
            # Fallback to general prompt
            return self.general_prompt(query)
        
        # 3. Execute with skill-specific approach
        return skill.execute(query)

class DocumentQASkill:
    """Uses RAG for document Q&A."""
    
    def __init__(self, rag_system):
        self.rag = rag_system
    
    def execute(self, query: str) -> str:
        contexts = self.rag.retrieve(query)
        # ... generate with context

class CreativeWritingSkill:
    """Uses fine-tuned model for creative tasks."""
    
    def __init__(self, fine_tuned_model):
        self.model = fine_tuned_model
    
    def execute(self, query: str) -> str:
        # Use fine-tuned model
        return generate_with_model(self.model, query)

When to use:

  • Multiple distinct tasks
  • Different approaches optimal for each
  • Complex agent systems

ROI Analysis by Approach

Cost Per Request

ApproachInput TokensOutput TokensCost/RequestMonthly (10K req)
Basic Prompt500300$0.033$330
Advanced Prompt1500300$0.063$630
RAG (avg)2000300$0.078$780
Fine-tuned (7B)500300$0.003$30
Hybrid (FT+RAG)2000300$0.063$630

Accuracy vs Cost Trade-off

Cost vs Accuracy Trade-off

Sweet spots:

  • Budget constrained: Fine-tuned 7B model (high accuracy, low cost)
  • Accuracy critical: GPT-4 + RAG (best results, higher cost)
  • Speed critical: Fine-tuned small model (<1s latency)

Real Case Studies

Case 1: Document Extraction Pipeline

Evolution:

  1. Month 1: Basic prompting → 72% accuracy
  2. Month 2: Advanced prompting (few-shot, schemas) → 85% accuracy
  3. Month 3: RAG (retrieve similar extractions) → 89% accuracy
  4. Month 4: Fine-tuned schema extractors → 96% accuracy

Final architecture:

  • Fine-tuned model for extraction logic
  • RAG for similar-document context
  • Total cost: $0.005/request (vs $0.03 for GPT-4)

Case 2: DeepAgent

Evolution:

  1. Month 1: System prompts only → Functional but generic
  2. Month 2: RAG for skill documentation → Better context
  3. Month 3: Fine-tuned reasoning model → Better reasoning chains

Final architecture:

  • Skills-based routing
  • RAG for dynamic knowledge
  • Fine-tuned models for specific skills
  • SSE streaming for real-time responses

Case 3: Customer Support Chatbot (Consulting Project)

Decision:

  • Started with RAG (product docs change frequently)
  • Added fine-tuning for tone consistency
  • Kept GPT-4 for complex escalations

Results:

  • 87% resolution rate without human
  • $0.08/request average cost
  • 3-month payback period

The "Don't" List

  • Don't fine-tune for static knowledge (use RAG or prompt)
  • Don't use RAG for code generation (use fine-tuned code models)
  • Don't start with fine-tuning (expensive experiment)
  • Don't over-engineer early (start simple, measure, iterate)
  • Don't ignore latency (RAG adds 100-200ms)
  • Don't skip evaluation (know when to upgrade)

Production Decision Checklist

Before choosing your approach:

  • Measured baseline with prompt engineering
  • Defined accuracy requirements (target number)
  • Calculated cost budget per request
  • Assessed latency requirements
  • Estimated training data availability
  • Evaluated infrastructure resources
  • Considered knowledge update frequency
  • Planned evaluation framework
  • Documented rollback strategy

Summary: The Framework

PhaseApproachTimeCostWhen to Move On
1Prompt EngineeringDaysLowAccuracy plateau, high latency
2RAG1–2 weeksMediumStill need higher accuracy or style control
3Fine-Tuning2–4 weeksHighMaximize accuracy, minimize latency/cost

My rule: Start with prompting, add RAG at month 2 if needed, fine-tune at month 6 if still necessary.

Questions to Ask Yourself

  1. What's my accuracy target? (Quantify it)
  2. What's my cost per request budget? (Cents matter at scale)
  3. How often does my knowledge change? (Daily = RAG, Yearly = Fine-tuning)
  4. Do I have 1000+ training examples? (Required for fine-tuning)
  5. Is latency critical? (<500ms = Fine-tuned small model)
  6. Do I need source attribution? (Yes = RAG)

Still unsure? Email me your specific scenario—I'll help you decide.


Related:

Related Resources

If you're building production LLM systems, these resources complement the decision framework above and cover the implementation details that sit beneath each technique.

Prompt Engineering

Retrieval-Augmented Generation

Fine-Tuning

Evaluation & Cost Tracking

  • Langfuse — Open-source LLM observability with tracing, scoring, and cost attribution.
  • Weights & Biases — Experiment tracking for fine-tuning runs and prompt versioning.

Questions? Email me or connect on LinkedIn.