Prompt EngineeringRAGFine-TuningLLM StrategyROI AnalysisDecision Framework

Prompt Engineering vs RAG vs Fine-Tuning: The Decision Framework I Actually Use

Published April 25, 202614 min read

Prompt Engineering vs RAG vs Fine-Tuning: The Decision Framework I Actually Use

The product manager asked for "an AI feature." Vague, exciting, and terrifying all at once.

I've sat in that meeting too many times. The team wants to ship fast, the PM wants magic, and you're staring at three paths: prompt engineering (cheap, fast), RAG (adds complexity), or fine-tuning (expensive, slow).

Choose wrong and you waste $50K and three months. This is the decision framework I've refined across the Document Extraction Pipeline, DeepAgent, and a half-dozen production systems.

The Decision Matrix

Start with this 2×2 matrix. It narrows your options in 30 seconds.

Decision Matrix

	Static Knowledge	Dynamic Knowledge
Generic Output	Prompt Engineering	RAG
Specific Output	Fine-Tuning	RAG + Fine-Tuning

Definitions:

Static Knowledge: Fixed facts (product docs, legal clauses)
Dynamic Knowledge: Frequently updated (stock prices, user data)
Generic Output: Standard tone, common formats
Specific Output: Brand voice, specialized formatting, unique style

Phase 1: Prompt Engineering (Always Start Here)

Cost: $0.001–0.02 per 1K tokens Time: Hours to days Break-even: Immediate

You should always start here. It's the cheapest way to establish a baseline and understand your problem.

What You Can Achieve with Prompting

# Example: Structured extraction with just prompting
EXTRACTION_PROMPT = """Extract invoice information from the text below.

## Text
{text}

## Instructions
1. Identify the invoice number, date, total amount, and vendor
2. Return ONLY valid JSON in this exact format:
{{
    "invoice_number": "...",
    "date": "YYYY-MM-DD",
    "total_amount": 0.00,
    "vendor": "..."
}}

3. Use null for missing fields
4. Ensure date is ISO format (YYYY-MM-DD)
5. Amount should be a number, not string

## Output"""

response = openai_client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": EXTRACTION_PROMPT.format(text=invoice_text)}],
    response_format={"type": "json_object"}  # Enforce JSON output
)

Advanced prompt engineering techniques:

Chain-of-thought: Ask the model to reason step-by-step
Few-shot examples: Include 2-3 examples in the prompt
Structured outputs: Use JSON schema or function calling
System prompts: Set behavior in the system message
Prompt chaining: Break complex tasks into multiple calls

When to Move On

You know you've hit the prompt engineering ceiling when:

Accuracy plateaus (not improving with better prompts)
Latency is too high (prompt is 5K+ tokens)
Context window overflow (prompt + context > 100K tokens)
Consistency issues (outputs vary significantly)
Cost per request exceeds budget

From the Document Extraction Pipeline:

Started with basic prompting: 72% extraction accuracy
Advanced prompting (few-shot, schemas): 85% accuracy
Ceiling: Couldn't reach 95% without external knowledge
Decision: Move to RAG at month 2

Phase 2: RAG (Dynamic Knowledge)

Cost: ~~20% latency increase + vector DB costs (~~$0.10/1M vectors) Time: 1–2 weeks Break-even: When you need proprietary or frequently updated data

RAG (Retrieval-Augmented Generation) adds a knowledge base that the LLM can query.

When RAG is the Right Choice

✅ Use RAG when:

Knowledge changes frequently (product docs, policies)
You have proprietary data not in training corpus
You need source attribution ("according to doc X...")
You want to reduce hallucinations with grounded context
You need to control information access (user-specific docs)

❌ Don't use RAG when:

Knowledge is static and fits in context window
You need specific tone/style (use fine-tuning)
You're doing code generation (use fine-tuned code models)
Retrieval adds unacceptable latency (>200ms)

RAG Architecture

from sentence_transformers import SentenceTransformer
import pinecone

class RAGSystem:
    def __init__(self):
        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
        self.vector_db = pinecone.Index("document-embeddings")
    
    def query(self, question: str, top_k: int = 5) -> str:
        """Answer question using RAG."""
        
        # 1. Embed the question
        query_embedding = self.embedding_model.encode(question)
        
        # 2. Retrieve relevant chunks
        results = self.vector_db.query(
            vector=query_embedding.tolist(),
            top_k=top_k,
            include_metadata=True
        )
        
        # 3. Format context
        contexts = [match.metadata["text"] for match in results.matches]
        context_str = "\n\n---\n\n".join(contexts)
        
        # 4. Generate with context
        prompt = f"""Answer the question using the provided context.
If the answer isn't in the context, say "I don't have that information."

## Context
{context_str}

## Question
{question}

## Answer"""
        
        response = openai_client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )
        
        return response.choices[0].message.content

Chunking Strategy

The #1 mistake in RAG: wrong chunk size.

class DocumentChunker:
    """Chunk documents for optimal retrieval."""
    
    def __init__(
        self,
        chunk_size: int = 500,      # Tokens per chunk
        overlap: int = 100           # Overlap between chunks
    ):
        self.chunk_size = chunk_size
        self.overlap = overlap
    
    def chunk_text(self, text: str) -> List[dict]:
        """Create overlapping chunks."""
        
        # Simple word-based chunking (use tiktoken for token-based)
        words = text.split()
        chunks = []
        
        start = 0
        while start < len(words):
            end = min(start + self.chunk_size, len(words))
            chunk_words = words[start:end]
            
            chunks.append({
                "text": " ".join(chunk_words),
                "start_idx": start,
                "end_idx": end,
                "metadata": {
                    "chunk_index": len(chunks),
                    "total_chunks": None  # Set later
                }
            })
            
            start += self.chunk_size - self.overlap
        
        # Update total
        for chunk in chunks:
            chunk["metadata"]["total_chunks"] = len(chunks)
        
        return chunks

# Optimal chunk sizes by use case:
CHUNK_CONFIGS = {
    "general_qa": {"size": 500, "overlap": 100},
    "legal_documents": {"size": 1000, "overlap": 200},
    "code": {"size": 300, "overlap": 50},
    "technical_docs": {"size": 400, "overlap": 80}
}

When RAG Wasn't Enough

Document Extraction Pipeline case study:

RAG improved accuracy to 89%
But specialized invoice formats needed specific extraction logic
Prompt + RAG reached 92%—still below 95% SLA
Decision: Fine-tuned schema-specific extractors at month 4

Phase 3: Fine-Tuning (The Nuclear Option)

Cost: $500–5,000 training + ongoing inference Time: 2–4 weeks Break-even: When you need specific tone/format or want to distil a big model

Fine-tuning adapts a base model to your specific task. It's powerful but expensive.

When Fine-Tuning is the Right Choice

✅ Use fine-tuning when:

RAG + prompting can't achieve required accuracy
You need specific tone, style, or format consistency
You want to reduce latency (distil GPT-4 → smaller model)
You want to reduce cost (fine-tuned 7B model vs GPT-4)
You have 1,000+ high-quality training examples

❌ Don't fine-tune when:

You have <500 training examples (use few-shot prompting)
Knowledge is dynamic (use RAG)
You need general capabilities (use base model)
Budget is constrained (training + serving costs)

Fine-Tuning Example

# 1. Prepare training data
# training_data.jsonl
{"messages": [
    {"role": "system", "content": "You extract invoice data from text."},
    {"role": "user", "content": "Invoice #12345 dated 2024-01-15..."},
    {"role": "assistant", "content": '{"invoice_number": "12345", "date": "2024-01-15", "total": 1500.00}'}
]}

# 2. Upload and train (OpenAI example)
import openai

# Upload training file
with open("training_data.jsonl", "rb") as f:
    file = openai.files.create(file=f, purpose="fine-tune")

# Create fine-tuning job
job = openai.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-3.5-turbo",
    suffix="invoice-extractor"
)

# 3. Use fine-tuned model
response = openai_client.chat.completions.create(
    model="ft:gpt-3.5-turbo:my-org:invoice-extractor:12345",  # Fine-tuned model
    messages=[{"role": "user", "content": invoice_text}]
)

Cost Comparison: Fine-Tuned vs GPT-4

Metric	GPT-4	Fine-Tuned 7B Model
Training cost	$0	$2,000–5,000
Inference cost/1K tokens	$0.03	$0.002
Latency	2–5s	0.5–1s
Accuracy (invoice extraction)	92%	96%
Break-even volume	N/A	~200K requests

From Document Extraction Pipeline:

Fine-tuned on 5,000 labeled invoices
Training cost: $3,200
Inference cost reduction: 93% vs GPT-4
Accuracy improvement: 92% → 96%
Break-even: Month 3 at current volume

Distillation: Big → Small

Use GPT-4 to generate training data, then fine-tune a small model:

class DistillationPipeline:
    """Distil GPT-4 into smaller model."""
    
    def generate_training_data(
        self,
        inputs: List[str],
        teacher_model: str = "gpt-4"
    ) -> List[dict]:
        """Generate high-quality training data from teacher model."""
        
        training_examples = []
        
        for input_text in inputs:
            # Get teacher response
            teacher_response = openai_client.chat.completions.create(
                model=teacher_model,
                messages=[{"role": "user", "content": input_text}],
                temperature=0.0  # Deterministic
            )
            
            training_examples.append({
                "messages": [
                    {"role": "user", "content": input_text},
                    {"role": "assistant", "content": teacher_response.choices[0].message.content}
                ]
            })
        
        return training_examples
    
    def fine_tune_student(
        self,
        training_data: List[dict],
        student_base: str = "meta-llama/Llama-2-7b"
    ):
        """Fine-tune student model on teacher-generated data."""
        
        # Use HuggingFace or custom training pipeline
        from transformers import AutoModelForCausalLM, TrainingArguments
        
        model = AutoModelForCausalLM.from_pretrained(student_base)
        
        training_args = TrainingArguments(
            output_dir="./distilled_model",
            num_train_epochs=3,
            per_device_train_batch_size=4,
            learning_rate=2e-5,
            warmup_steps=100
        )
        
        # Train...
        trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=training_data
        )
        
        trainer.train()

The 10-Question Decision Framework

Use this flowchart to decide:

Decision Flowchart

The 10 questions:

Is your knowledge static or dynamic?
- Static → Consider fine-tuning
- Dynamic → Use RAG
Do you need specific tone/format/style?
- Yes → Fine-tuning may be needed
- No → Start with prompting
How many training examples do you have?
- <100 → Prompt engineering only
- 100–500 → Few-shot prompting
- 500–1000 → RAG + prompting
- 1000 → Can consider fine-tuning
What accuracy do you need?
- <85% → Prompt engineering
- 85–92% → RAG + prompting
- 92% → Fine-tuning or hybrid
Is latency critical?
- Yes (<500ms) → Fine-tuned small model
- No → GPT-4 + RAG
Is cost per request critical?
- Yes (<$0.01) → Fine-tuned model
- No → Use best available model
Do you need source attribution?
- Yes → RAG
- No → Any approach works
How often does knowledge change?
- Daily/weekly → RAG
- Monthly/yearly → Fine-tuning viable
- Never → Prompt engineering
Do you have engineering resources for infrastructure?
- Yes → RAG (vector DB)
- Limited → Prompt engineering or fine-tune via API
What's your timeline?
- Days → Prompt engineering
- Weeks → RAG
- Months → Fine-tuning

Hybrid Approaches (What Actually Works)

The best systems combine all three:

Pattern 1: Fine-Tuned Base + RAG Context

class HybridSystem:
    """Fine-tuned model with RAG context injection."""
    
    def __init__(self):
        self.fine_tuned_model = "ft:gpt-3.5-turbo:custom:123"
        self.rag = RAGSystem()
    
    def generate(self, query: str) -> str:
        # 1. Retrieve relevant context
        contexts = self.rag.retrieve(query, top_k=3)
        
        # 2. Format prompt with context
        prompt = f"""Answer using your training AND the following context:
        
## Context
{contexts}

## Query
{query}

## Answer"""
        
        # 3. Use fine-tuned model (knows task + format)
        response = openai_client.chat.completions.create(
            model=self.fine_tuned_model,
            messages=[{"role": "user", "content": prompt}]
        )
        
        return response.choices[0].message.content

When to use:

Need specific format/style (fine-tuning)
Knowledge is dynamic (RAG)
DeepAgent uses this pattern for skill-specific responses

Pattern 2: Prompt Templates + Vector Retrieval

class TemplateRAGSystem:
    """Prompt templates with dynamic context injection."""
    
    def __init__(self):
        self.templates = {
            "support": "You are a helpful support agent...",
            "sales": "You are a sales consultant...",
            "technical": "You are a technical expert..."
        }
        self.rag = RAGSystem()
    
    def generate(self, query: str, persona: str) -> str:
        # 1. Select template
        system_prompt = self.templates.get(persona, self.templates["support"])
        
        # 2. Retrieve persona-specific context
        contexts = self.rag.retrieve(
            query, 
            top_k=5,
            filter={"category": persona}
        )
        
        # 3. Generate with template + context
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Context: {contexts}\n\nQuestion: {query}"}
        ]
        
        return openai_client.chat.completions.create(
            model="gpt-4",
            messages=messages
        )

When to use:

Multiple personas/use cases
Shared knowledge base
Need consistent tone per persona

Pattern 3: Skills-Based (DeepAgent Pattern)

class SkillsBasedSystem:
    """DeepAgent-style skill routing."""
    
    def __init__(self):
        self.skills = {
            "web_search": WebSearchSkill(),
            "calculator": CalculatorSkill(),
            "document_qa": DocumentQASkill(rag_system),
            "creative_writing": CreativeWritingSkill(fine_tuned_model)
        }
    
    def route_and_execute(self, query: str) -> str:
        # 1. Classify intent
        intent = self.classify_intent(query)
        
        # 2. Route to appropriate skill
        skill = self.skills.get(intent)
        
        if not skill:
            # Fallback to general prompt
            return self.general_prompt(query)
        
        # 3. Execute with skill-specific approach
        return skill.execute(query)

class DocumentQASkill:
    """Uses RAG for document Q&A."""
    
    def __init__(self, rag_system):
        self.rag = rag_system
    
    def execute(self, query: str) -> str:
        contexts = self.rag.retrieve(query)
        # ... generate with context

class CreativeWritingSkill:
    """Uses fine-tuned model for creative tasks."""
    
    def __init__(self, fine_tuned_model):
        self.model = fine_tuned_model
    
    def execute(self, query: str) -> str:
        # Use fine-tuned model
        return generate_with_model(self.model, query)

When to use:

Multiple distinct tasks
Different approaches optimal for each
Complex agent systems

ROI Analysis by Approach

Cost Per Request

Approach	Input Tokens	Output Tokens	Cost/Request	Monthly (10K req)
Basic Prompt	500	300	$0.033	$330
Advanced Prompt	1500	300	$0.063	$630
RAG (avg)	2000	300	$0.078	$780
Fine-tuned (7B)	500	300	$0.003	$30
Hybrid (FT+RAG)	2000	300	$0.063	$630

Accuracy vs Cost Trade-off

Cost vs Accuracy Trade-off

Sweet spots:

Budget constrained: Fine-tuned 7B model (high accuracy, low cost)
Accuracy critical: GPT-4 + RAG (best results, higher cost)
Speed critical: Fine-tuned small model (<1s latency)

Real Case Studies

Case 1: Document Extraction Pipeline

Evolution:

Month 1: Basic prompting → 72% accuracy
Month 2: Advanced prompting (few-shot, schemas) → 85% accuracy
Month 3: RAG (retrieve similar extractions) → 89% accuracy
Month 4: Fine-tuned schema extractors → 96% accuracy

Final architecture:

Fine-tuned model for extraction logic
RAG for similar-document context
Total cost: $0.005/request (vs $0.03 for GPT-4)

Case 2: DeepAgent

Evolution:

Month 1: System prompts only → Functional but generic
Month 2: RAG for skill documentation → Better context
Month 3: Fine-tuned reasoning model → Better reasoning chains

Final architecture:

Skills-based routing
RAG for dynamic knowledge
Fine-tuned models for specific skills
SSE streaming for real-time responses

Case 3: Customer Support Chatbot (Consulting Project)

Decision:

Started with RAG (product docs change frequently)
Added fine-tuning for tone consistency
Kept GPT-4 for complex escalations

Results:

87% resolution rate without human
$0.08/request average cost
3-month payback period

The "Don't" List

❌ Don't fine-tune for static knowledge (use RAG or prompt)
❌ Don't use RAG for code generation (use fine-tuned code models)
❌ Don't start with fine-tuning (expensive experiment)
❌ Don't over-engineer early (start simple, measure, iterate)
❌ Don't ignore latency (RAG adds 100-200ms)
❌ Don't skip evaluation (know when to upgrade)

Production Decision Checklist

Before choosing your approach:

Measured baseline with prompt engineering
Defined accuracy requirements (target number)
Calculated cost budget per request
Assessed latency requirements
Estimated training data availability
Evaluated infrastructure resources
Considered knowledge update frequency
Planned evaluation framework
Documented rollback strategy

Summary: The Framework

Phase	Approach	Time	Cost	When to Move On
1	Prompt Engineering	Days	Low	Accuracy plateau, high latency
2	RAG	1–2 weeks	Medium	Still need higher accuracy or style control
3	Fine-Tuning	2–4 weeks	High	Maximize accuracy, minimize latency/cost

My rule: Start with prompting, add RAG at month 2 if needed, fine-tune at month 6 if still necessary.

Questions to Ask Yourself

What's my accuracy target? (Quantify it)
What's my cost per request budget? (Cents matter at scale)
How often does my knowledge change? (Daily = RAG, Yearly = Fine-tuning)
Do I have 1000+ training examples? (Required for fine-tuning)
Is latency critical? (<500ms = Fine-tuned small model)
Do I need source attribution? (Yes = RAG)

Still unsure? Email me your specific scenario—I'll help you decide.

Related:

Related Resources

If you're building production LLM systems, these resources complement the decision framework above and cover the implementation details that sit beneath each technique.

Prompt Engineering

OpenAI Prompt Engineering Guide — Structured best practices from the provider that defined the field.
Anthropic's Claude Prompt Library — Curated patterns for reasoning, extraction, and classification tasks.

Retrieval-Augmented Generation

LangChain RAG Tutorial — End-to-end walkthrough with vector stores, embedding models, and chunking strategies.
LlamaIndex Documentation — Advanced RAG patterns including recursive retrieval and agentic query engines.

Fine-Tuning

OpenAI Fine-Tuning Guide — Data preparation, hyperparameter selection, and evaluation protocols.
Hugging Face TRL Library — Open-source fine-tuning with PPO, DPO, and ORPO methods.

Evaluation & Cost Tracking

Langfuse — Open-source LLM observability with tracing, scoring, and cost attribution.
Weights & Biases — Experiment tracking for fine-tuning runs and prompt versioning.

Questions? Email me or connect on LinkedIn.