Prompt Engineering vs RAG vs Fine-Tuning: The Decision Framework I Actually Use
Prompt Engineering vs RAG vs Fine-Tuning: The Decision Framework I Actually Use
The product manager asked for "an AI feature." Vague, exciting, and terrifying all at once.
I've sat in that meeting too many times. The team wants to ship fast, the PM wants magic, and you're staring at three paths: prompt engineering (cheap, fast), RAG (adds complexity), or fine-tuning (expensive, slow).
Choose wrong and you waste $50K and three months. This is the decision framework I've refined across the Document Extraction Pipeline, DeepAgent, and a half-dozen production systems.
The Decision Matrix
Start with this 2×2 matrix. It narrows your options in 30 seconds.
| Static Knowledge | Dynamic Knowledge | |
|---|---|---|
| Generic Output | Prompt Engineering | RAG |
| Specific Output | Fine-Tuning | RAG + Fine-Tuning |
Definitions:
- Static Knowledge: Fixed facts (product docs, legal clauses)
- Dynamic Knowledge: Frequently updated (stock prices, user data)
- Generic Output: Standard tone, common formats
- Specific Output: Brand voice, specialized formatting, unique style
Phase 1: Prompt Engineering (Always Start Here)
Cost: $0.001–0.02 per 1K tokens Time: Hours to days Break-even: Immediate
You should always start here. It's the cheapest way to establish a baseline and understand your problem.
What You Can Achieve with Prompting
# Example: Structured extraction with just prompting EXTRACTION_PROMPT = """Extract invoice information from the text below. ## Text {text} ## Instructions 1. Identify the invoice number, date, total amount, and vendor 2. Return ONLY valid JSON in this exact format: {{ "invoice_number": "...", "date": "YYYY-MM-DD", "total_amount": 0.00, "vendor": "..." }} 3. Use null for missing fields 4. Ensure date is ISO format (YYYY-MM-DD) 5. Amount should be a number, not string ## Output""" response = openai_client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": EXTRACTION_PROMPT.format(text=invoice_text)}], response_format={"type": "json_object"} # Enforce JSON output )
Advanced prompt engineering techniques:
- Chain-of-thought: Ask the model to reason step-by-step
- Few-shot examples: Include 2-3 examples in the prompt
- Structured outputs: Use JSON schema or function calling
- System prompts: Set behavior in the system message
- Prompt chaining: Break complex tasks into multiple calls
When to Move On
You know you've hit the prompt engineering ceiling when:
- Accuracy plateaus (not improving with better prompts)
- Latency is too high (prompt is 5K+ tokens)
- Context window overflow (prompt + context > 100K tokens)
- Consistency issues (outputs vary significantly)
- Cost per request exceeds budget
From the Document Extraction Pipeline:
- Started with basic prompting: 72% extraction accuracy
- Advanced prompting (few-shot, schemas): 85% accuracy
- Ceiling: Couldn't reach 95% without external knowledge
- Decision: Move to RAG at month 2
Phase 2: RAG (Dynamic Knowledge)
Cost: 20% latency increase + vector DB costs ($0.10/1M vectors)
Time: 1–2 weeks
Break-even: When you need proprietary or frequently updated data
RAG (Retrieval-Augmented Generation) adds a knowledge base that the LLM can query.
When RAG is the Right Choice
✅ Use RAG when:
- Knowledge changes frequently (product docs, policies)
- You have proprietary data not in training corpus
- You need source attribution ("according to doc X...")
- You want to reduce hallucinations with grounded context
- You need to control information access (user-specific docs)
❌ Don't use RAG when:
- Knowledge is static and fits in context window
- You need specific tone/style (use fine-tuning)
- You're doing code generation (use fine-tuned code models)
- Retrieval adds unacceptable latency (>200ms)
RAG Architecture
from sentence_transformers import SentenceTransformer import pinecone class RAGSystem: def __init__(self): self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2') self.vector_db = pinecone.Index("document-embeddings") def query(self, question: str, top_k: int = 5) -> str: """Answer question using RAG.""" # 1. Embed the question query_embedding = self.embedding_model.encode(question) # 2. Retrieve relevant chunks results = self.vector_db.query( vector=query_embedding.tolist(), top_k=top_k, include_metadata=True ) # 3. Format context contexts = [match.metadata["text"] for match in results.matches] context_str = "\n\n---\n\n".join(contexts) # 4. Generate with context prompt = f"""Answer the question using the provided context. If the answer isn't in the context, say "I don't have that information." ## Context {context_str} ## Question {question} ## Answer""" response = openai_client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": prompt}] ) return response.choices[0].message.content
Chunking Strategy
The #1 mistake in RAG: wrong chunk size.
class DocumentChunker: """Chunk documents for optimal retrieval.""" def __init__( self, chunk_size: int = 500, # Tokens per chunk overlap: int = 100 # Overlap between chunks ): self.chunk_size = chunk_size self.overlap = overlap def chunk_text(self, text: str) -> List[dict]: """Create overlapping chunks.""" # Simple word-based chunking (use tiktoken for token-based) words = text.split() chunks = [] start = 0 while start < len(words): end = min(start + self.chunk_size, len(words)) chunk_words = words[start:end] chunks.append({ "text": " ".join(chunk_words), "start_idx": start, "end_idx": end, "metadata": { "chunk_index": len(chunks), "total_chunks": None # Set later } }) start += self.chunk_size - self.overlap # Update total for chunk in chunks: chunk["metadata"]["total_chunks"] = len(chunks) return chunks # Optimal chunk sizes by use case: CHUNK_CONFIGS = { "general_qa": {"size": 500, "overlap": 100}, "legal_documents": {"size": 1000, "overlap": 200}, "code": {"size": 300, "overlap": 50}, "technical_docs": {"size": 400, "overlap": 80} }
When RAG Wasn't Enough
Document Extraction Pipeline case study:
- RAG improved accuracy to 89%
- But specialized invoice formats needed specific extraction logic
- Prompt + RAG reached 92%—still below 95% SLA
- Decision: Fine-tuned schema-specific extractors at month 4
Phase 3: Fine-Tuning (The Nuclear Option)
Cost: $500–5,000 training + ongoing inference Time: 2–4 weeks Break-even: When you need specific tone/format or want to distil a big model
Fine-tuning adapts a base model to your specific task. It's powerful but expensive.
When Fine-Tuning is the Right Choice
✅ Use fine-tuning when:
- RAG + prompting can't achieve required accuracy
- You need specific tone, style, or format consistency
- You want to reduce latency (distil GPT-4 → smaller model)
- You want to reduce cost (fine-tuned 7B model vs GPT-4)
- You have 1,000+ high-quality training examples
❌ Don't fine-tune when:
- You have <500 training examples (use few-shot prompting)
- Knowledge is dynamic (use RAG)
- You need general capabilities (use base model)
- Budget is constrained (training + serving costs)
Fine-Tuning Example
# 1. Prepare training data # training_data.jsonl {"messages": [ {"role": "system", "content": "You extract invoice data from text."}, {"role": "user", "content": "Invoice #12345 dated 2024-01-15..."}, {"role": "assistant", "content": '{"invoice_number": "12345", "date": "2024-01-15", "total": 1500.00}'} ]} # 2. Upload and train (OpenAI example) import openai # Upload training file with open("training_data.jsonl", "rb") as f: file = openai.files.create(file=f, purpose="fine-tune") # Create fine-tuning job job = openai.fine_tuning.jobs.create( training_file=file.id, model="gpt-3.5-turbo", suffix="invoice-extractor" ) # 3. Use fine-tuned model response = openai_client.chat.completions.create( model="ft:gpt-3.5-turbo:my-org:invoice-extractor:12345", # Fine-tuned model messages=[{"role": "user", "content": invoice_text}] )
Cost Comparison: Fine-Tuned vs GPT-4
| Metric | GPT-4 | Fine-Tuned 7B Model |
|---|---|---|
| Training cost | $0 | $2,000–5,000 |
| Inference cost/1K tokens | $0.03 | $0.002 |
| Latency | 2–5s | 0.5–1s |
| Accuracy (invoice extraction) | 92% | 96% |
| Break-even volume | N/A | ~200K requests |
From Document Extraction Pipeline:
- Fine-tuned on 5,000 labeled invoices
- Training cost: $3,200
- Inference cost reduction: 93% vs GPT-4
- Accuracy improvement: 92% → 96%
- Break-even: Month 3 at current volume
Distillation: Big → Small
Use GPT-4 to generate training data, then fine-tune a small model:
class DistillationPipeline: """Distil GPT-4 into smaller model.""" def generate_training_data( self, inputs: List[str], teacher_model: str = "gpt-4" ) -> List[dict]: """Generate high-quality training data from teacher model.""" training_examples = [] for input_text in inputs: # Get teacher response teacher_response = openai_client.chat.completions.create( model=teacher_model, messages=[{"role": "user", "content": input_text}], temperature=0.0 # Deterministic ) training_examples.append({ "messages": [ {"role": "user", "content": input_text}, {"role": "assistant", "content": teacher_response.choices[0].message.content} ] }) return training_examples def fine_tune_student( self, training_data: List[dict], student_base: str = "meta-llama/Llama-2-7b" ): """Fine-tune student model on teacher-generated data.""" # Use HuggingFace or custom training pipeline from transformers import AutoModelForCausalLM, TrainingArguments model = AutoModelForCausalLM.from_pretrained(student_base) training_args = TrainingArguments( output_dir="./distilled_model", num_train_epochs=3, per_device_train_batch_size=4, learning_rate=2e-5, warmup_steps=100 ) # Train... trainer = Trainer( model=model, args=training_args, train_dataset=training_data ) trainer.train()
The 10-Question Decision Framework
Use this flowchart to decide:
The 10 questions:
-
Is your knowledge static or dynamic?
- Static → Consider fine-tuning
- Dynamic → Use RAG
-
Do you need specific tone/format/style?
- Yes → Fine-tuning may be needed
- No → Start with prompting
-
How many training examples do you have?
- <100 → Prompt engineering only
- 100–500 → Few-shot prompting
- 500–1000 → RAG + prompting
-
1000 → Can consider fine-tuning
-
What accuracy do you need?
- <85% → Prompt engineering
- 85–92% → RAG + prompting
-
92% → Fine-tuning or hybrid
-
Is latency critical?
- Yes (<500ms) → Fine-tuned small model
- No → GPT-4 + RAG
-
Is cost per request critical?
- Yes (<$0.01) → Fine-tuned model
- No → Use best available model
-
Do you need source attribution?
- Yes → RAG
- No → Any approach works
-
How often does knowledge change?
- Daily/weekly → RAG
- Monthly/yearly → Fine-tuning viable
- Never → Prompt engineering
-
Do you have engineering resources for infrastructure?
- Yes → RAG (vector DB)
- Limited → Prompt engineering or fine-tune via API
-
What's your timeline?
- Days → Prompt engineering
- Weeks → RAG
- Months → Fine-tuning
Hybrid Approaches (What Actually Works)
The best systems combine all three:
Pattern 1: Fine-Tuned Base + RAG Context
class HybridSystem: """Fine-tuned model with RAG context injection.""" def __init__(self): self.fine_tuned_model = "ft:gpt-3.5-turbo:custom:123" self.rag = RAGSystem() def generate(self, query: str) -> str: # 1. Retrieve relevant context contexts = self.rag.retrieve(query, top_k=3) # 2. Format prompt with context prompt = f"""Answer using your training AND the following context: ## Context {contexts} ## Query {query} ## Answer""" # 3. Use fine-tuned model (knows task + format) response = openai_client.chat.completions.create( model=self.fine_tuned_model, messages=[{"role": "user", "content": prompt}] ) return response.choices[0].message.content
When to use:
- Need specific format/style (fine-tuning)
- Knowledge is dynamic (RAG)
- DeepAgent uses this pattern for skill-specific responses
Pattern 2: Prompt Templates + Vector Retrieval
class TemplateRAGSystem: """Prompt templates with dynamic context injection.""" def __init__(self): self.templates = { "support": "You are a helpful support agent...", "sales": "You are a sales consultant...", "technical": "You are a technical expert..." } self.rag = RAGSystem() def generate(self, query: str, persona: str) -> str: # 1. Select template system_prompt = self.templates.get(persona, self.templates["support"]) # 2. Retrieve persona-specific context contexts = self.rag.retrieve( query, top_k=5, filter={"category": persona} ) # 3. Generate with template + context messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": f"Context: {contexts}\n\nQuestion: {query}"} ] return openai_client.chat.completions.create( model="gpt-4", messages=messages )
When to use:
- Multiple personas/use cases
- Shared knowledge base
- Need consistent tone per persona
Pattern 3: Skills-Based (DeepAgent Pattern)
class SkillsBasedSystem: """DeepAgent-style skill routing.""" def __init__(self): self.skills = { "web_search": WebSearchSkill(), "calculator": CalculatorSkill(), "document_qa": DocumentQASkill(rag_system), "creative_writing": CreativeWritingSkill(fine_tuned_model) } def route_and_execute(self, query: str) -> str: # 1. Classify intent intent = self.classify_intent(query) # 2. Route to appropriate skill skill = self.skills.get(intent) if not skill: # Fallback to general prompt return self.general_prompt(query) # 3. Execute with skill-specific approach return skill.execute(query) class DocumentQASkill: """Uses RAG for document Q&A.""" def __init__(self, rag_system): self.rag = rag_system def execute(self, query: str) -> str: contexts = self.rag.retrieve(query) # ... generate with context class CreativeWritingSkill: """Uses fine-tuned model for creative tasks.""" def __init__(self, fine_tuned_model): self.model = fine_tuned_model def execute(self, query: str) -> str: # Use fine-tuned model return generate_with_model(self.model, query)
When to use:
- Multiple distinct tasks
- Different approaches optimal for each
- Complex agent systems
ROI Analysis by Approach
Cost Per Request
| Approach | Input Tokens | Output Tokens | Cost/Request | Monthly (10K req) |
|---|---|---|---|---|
| Basic Prompt | 500 | 300 | $0.033 | $330 |
| Advanced Prompt | 1500 | 300 | $0.063 | $630 |
| RAG (avg) | 2000 | 300 | $0.078 | $780 |
| Fine-tuned (7B) | 500 | 300 | $0.003 | $30 |
| Hybrid (FT+RAG) | 2000 | 300 | $0.063 | $630 |
Accuracy vs Cost Trade-off
Sweet spots:
- Budget constrained: Fine-tuned 7B model (high accuracy, low cost)
- Accuracy critical: GPT-4 + RAG (best results, higher cost)
- Speed critical: Fine-tuned small model (<1s latency)
Real Case Studies
Case 1: Document Extraction Pipeline
Evolution:
- Month 1: Basic prompting → 72% accuracy
- Month 2: Advanced prompting (few-shot, schemas) → 85% accuracy
- Month 3: RAG (retrieve similar extractions) → 89% accuracy
- Month 4: Fine-tuned schema extractors → 96% accuracy
Final architecture:
- Fine-tuned model for extraction logic
- RAG for similar-document context
- Total cost: $0.005/request (vs $0.03 for GPT-4)
Case 2: DeepAgent
Evolution:
- Month 1: System prompts only → Functional but generic
- Month 2: RAG for skill documentation → Better context
- Month 3: Fine-tuned reasoning model → Better reasoning chains
Final architecture:
- Skills-based routing
- RAG for dynamic knowledge
- Fine-tuned models for specific skills
- SSE streaming for real-time responses
Case 3: Customer Support Chatbot (Consulting Project)
Decision:
- Started with RAG (product docs change frequently)
- Added fine-tuning for tone consistency
- Kept GPT-4 for complex escalations
Results:
- 87% resolution rate without human
- $0.08/request average cost
- 3-month payback period
The "Don't" List
- ❌ Don't fine-tune for static knowledge (use RAG or prompt)
- ❌ Don't use RAG for code generation (use fine-tuned code models)
- ❌ Don't start with fine-tuning (expensive experiment)
- ❌ Don't over-engineer early (start simple, measure, iterate)
- ❌ Don't ignore latency (RAG adds 100-200ms)
- ❌ Don't skip evaluation (know when to upgrade)
Production Decision Checklist
Before choosing your approach:
- Measured baseline with prompt engineering
- Defined accuracy requirements (target number)
- Calculated cost budget per request
- Assessed latency requirements
- Estimated training data availability
- Evaluated infrastructure resources
- Considered knowledge update frequency
- Planned evaluation framework
- Documented rollback strategy
Summary: The Framework
| Phase | Approach | Time | Cost | When to Move On |
|---|---|---|---|---|
| 1 | Prompt Engineering | Days | Low | Accuracy plateau, high latency |
| 2 | RAG | 1–2 weeks | Medium | Still need higher accuracy or style control |
| 3 | Fine-Tuning | 2–4 weeks | High | Maximize accuracy, minimize latency/cost |
My rule: Start with prompting, add RAG at month 2 if needed, fine-tune at month 6 if still necessary.
Questions to Ask Yourself
- What's my accuracy target? (Quantify it)
- What's my cost per request budget? (Cents matter at scale)
- How often does my knowledge change? (Daily = RAG, Yearly = Fine-tuning)
- Do I have 1000+ training examples? (Required for fine-tuning)
- Is latency critical? (<500ms = Fine-tuned small model)
- Do I need source attribution? (Yes = RAG)
Still unsure? Email me your specific scenario—I'll help you decide.
Related:
- Production LLM System Architecture
- Resilient LLM API Patterns
- Memory Management for AI Agents
- Evaluating Generative AI
- Document Extraction Pipeline
- DeepAgent Project
Related Resources
If you're building production LLM systems, these resources complement the decision framework above and cover the implementation details that sit beneath each technique.
Prompt Engineering
- OpenAI Prompt Engineering Guide — Structured best practices from the provider that defined the field.
- Anthropic's Claude Prompt Library — Curated patterns for reasoning, extraction, and classification tasks.
Retrieval-Augmented Generation
- LangChain RAG Tutorial — End-to-end walkthrough with vector stores, embedding models, and chunking strategies.
- LlamaIndex Documentation — Advanced RAG patterns including recursive retrieval and agentic query engines.
Fine-Tuning
- OpenAI Fine-Tuning Guide — Data preparation, hyperparameter selection, and evaluation protocols.
- Hugging Face TRL Library — Open-source fine-tuning with PPO, DPO, and ORPO methods.
Evaluation & Cost Tracking
- Langfuse — Open-source LLM observability with tracing, scoring, and cost attribution.
- Weights & Biases — Experiment tracking for fine-tuning runs and prompt versioning.