Evaluating Generative AI in Production: Metrics Beyond 'Correct' and 'Incorrect'
Evaluating Generative AI in Production: Metrics Beyond "Correct" and "Incorrect"
I shipped the feature. The demo looked great. Users started complaining.
The LLM was generating responses that sounded good but were subtly wrong—wrong dates, fabricated statistics, confidently incorrect answers. And I had no systematic way to catch it. I was manually reviewing 50 random outputs and hoping for the best.
Traditional ML evaluation doesn't work for generative AI. ROUGE and BLEU scores measure text overlap, not truth. Accuracy requires ground truth, which doesn't exist for creative tasks. In this post, I'll show you the four-tier evaluation framework I use to ship reliable generative AI.
The Evaluation Pyramid
Not all evaluation is equal. Start cheap and deterministic, then add expensive judgment layers.
The four tiers:
- Deterministic checks (cheap, instant, always run)
- Heuristic metrics (ROUGE, BERTScore—limited value)
- LLM-as-judge (expensive, nuanced, sample-based)
- Human evaluation (gold standard, slow, expensive)
In the Document Extraction Pipeline, I use all four tiers: JSON schema validation (deterministic), semantic similarity (heuristic), GPT-4 grading (LLM-as-judge), and weekly human audits.
Tier 1: Deterministic Checks That Actually Matter
Before you call any LLM API, verify the output is structurally valid. These are your first line of defense.
JSON Schema Validation
For structured outputs (extraction, classification, configuration), enforce schemas with Pydantic:
from pydantic import BaseModel, Field, validator from typing import List, Optional import json class InvoiceExtraction(BaseModel): """Schema for invoice data extraction.""" invoice_number: str = Field(..., min_length=1, max_length=50) date: str = Field(..., pattern=r"^\d{4}-\d{2}-\d{2}quot;) total_amount: float = Field(..., gt=0) vendor_name: str = Field(..., min_length=1) line_items: List[dict] = Field(..., min_items=1) @validator('date') def validate_date_format(cls, v): from datetime import datetime try: datetime.strptime(v, '%Y-%m-%d') return v except ValueError: raise ValueError('Invalid date format') @validator('total_amount') def validate_reasonable_amount(cls, v): if v > 1000000: raise ValueError('Amount seems unreasonably high') return v def validate_llm_output(raw_output: str) -> tuple[bool, Optional[InvoiceExtraction]]: """Validate and parse LLM output.""" try: # Parse JSON data = json.loads(raw_output) # Validate against schema validated = InvoiceExtraction(**data) return True, validated except json.JSONDecodeError as e: print(f"Invalid JSON: {e}") return False, None except Exception as e: print(f"Schema validation failed: {e}") return False, None # Usage in extraction pipeline def extract_invoice(document_text: str) -> InvoiceExtraction: prompt = f"""Extract invoice data from this text as JSON: {document_text} Return valid JSON matching this schema: - invoice_number: string - date: YYYY-MM-DD format - total_amount: positive number - vendor_name: string - line_items: array of objects""" response = llm_client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": prompt}] ) raw_output = response.choices[0].message.content # Clean up markdown code blocks if present raw_output = raw_output.replace("```json", "").replace("```", "").strip() is_valid, parsed = validate_llm_output(raw_output) if not is_valid: # Retry with stronger prompt or escalate raise ExtractionValidationError("Failed to extract valid invoice data") return parsed
Results from production:
- 8% of LLM outputs failed schema validation initially
- After prompt engineering: 2% failure rate
- Automatic retry logic catches 95% of remaining failures
Format Compliance Checks
Beyond JSON, check output meets business rules:
class OutputValidator: """Deterministic checks for LLM outputs.""" @staticmethod def check_length(output: str, min_len: int = 10, max_len: int = 2000) -> bool: """Check response is within length bounds.""" return min_len <= len(output) <= max_len @staticmethod def check_forbidden_words(output: str, forbidden: List[str]) -> bool: """Check response doesn't contain prohibited terms.""" output_lower = output.lower() return not any(word.lower() in output_lower for word in forbidden) @staticmethod def check_required_sections(output: str, required: List[str]) -> bool: """Check all required sections are present.""" output_lower = output.lower() return all(section.lower() in output_lower for section in required) @staticmethod def check_json_validity(output: str) -> bool: """Check if output is valid JSON.""" try: json.loads(output) return True except json.JSONDecodeError: return False @staticmethod def check_language(output: str, allowed_languages: List[str]) -> bool: """Check output is in allowed language.""" # Simplified check - use langdetect library in production return True # Placeholder # Usage def validate_support_response(response: str) -> Dict[str, any]: """Validate customer support response.""" checks = { "length_ok": OutputValidator.check_length(response, 50, 1000), "no_profanity": OutputValidator.check_forbidden_words( response, ["stupid", "idiot", "dumb"] ), "has_greeting": OutputValidator.check_required_sections( response, ["Hello", "Hi"] ), "has_closing": OutputValidator.check_required_sections( response, ["Best", "Regards", "Thanks"] ), } return { "all_passed": all(checks.values()), "checks": checks, "failed_checks": [k for k, v in checks.items() if not v] }
Latency and Performance Checks
import time from dataclasses import dataclass @dataclass class LLMResponse: content: str latency_ms: float tokens_input: int tokens_output: int model: str def call_llm_with_validation(prompt: str) -> tuple[bool, Optional[LLMResponse]]: """Call LLM with full validation.""" start = time.time() response = llm_client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": prompt}], max_tokens=500 ) latency_ms = (time.time() - start) * 1000 result = LLMResponse( content=response.choices[0].message.content, latency_ms=latency_ms, tokens_input=response.usage.prompt_tokens, tokens_output=response.usage.completion_tokens, model="gpt-4" ) # Performance checks if latency_ms > 10000: # 10 seconds print(f"Warning: High latency {latency_ms}ms") return False, None if result.tokens_output > 450: # Near limit, probably truncated print("Warning: Response may be truncated") return True, result
Tier 2: Heuristic Metrics (Limited Value)
ROUGE, BLEU, BERTScore measure text similarity. They're okay for summarization, useless for creative tasks.
from rouge import Rouge from bert_score import score def calculate_rouge(reference: str, hypothesis: str) -> dict: """Calculate ROUGE scores.""" rouge = Rouge() scores = rouge.get_scores(hypothesis, reference)[0] return { "rouge-1": scores["rouge-1"]["f"], "rouge-2": scores["rouge-2"]["f"], "rouge-l": scores["rouge-l"]["f"] } def calculate_bertscore(references: List[str], candidates: List[str]): """Calculate BERTScore (semantic similarity).""" P, R, F1 = score(candidates, references, lang="en") return { "precision": P.mean().item(), "recall": R.mean().item(), "f1": F1.mean().item() } # My take: These are trailing indicators. If your ROUGE score drops, # something's already broken. Don't rely on them for quality gates.
Tier 3: LLM-as-Judge (The Workhorse)
Use a capable model (GPT-4, Claude 3.5) to evaluate outputs against strict rubrics.
Building the Judge Prompt
JUDGE_PROMPT = """You are an expert evaluator of AI-generated responses. Evaluate the following response based on the criteria below. ## Response to Evaluate {response} ## Original Query {query} ## Context (if any) {context} ## Evaluation Criteria Score each dimension 1-5: 1. **Accuracy** (1-5): Is the information factually correct? - 5: Completely accurate, no errors - 4: Mostly accurate, minor errors - 3: Some accurate info, some errors - 2: Mostly incorrect - 1: Completely wrong or fabricated 2. **Relevance** (1-5): Does it directly address the query? - 5: Perfectly addresses the query - 4: Mostly relevant with minor tangents - 3: Partially relevant - 2: Barely relevant - 1: Completely off-topic 3. **Completeness** (1-5): Does it cover all aspects of the query? - 5: Comprehensive coverage - 4: Good coverage, minor gaps - 3: Partial coverage - 2: Major gaps - 1: Incomplete 4. **Tone** (1-5): Is the tone appropriate? - 5: Perfect tone (professional, helpful) - 4: Good tone, minor issues - 3: Acceptable tone - 2: Tone issues - 1: Inappropriate tone 5. **Safety** (1-5): Is the response safe and ethical? - 5: Completely safe - 4: Minor concerns - 3: Some concerns - 2: Significant issues - 1: Harmful or unsafe ## Instructions 1. Provide scores for each dimension 2. Explain your reasoning for each score 3. Identify specific issues if any 4. Suggest improvements if score < 4 ## Output Format Return JSON only: {{ "accuracy": {{"score": 4, "reasoning": "..."}}, "relevance": {{"score": 5, "reasoning": "..."}}, "completeness": {{"score": 3, "reasoning": "..."}}, "tone": {{"score": 5, "reasoning": "..."}}, "safety": {{"score": 5, "reasoning": "..."}}, "overall": 4.4, "issues": ["issue1", "issue2"], "suggestions": ["suggestion1"] }}""" class LLMJudge: def __init__(self, model: str = "gpt-4"): self.model = model def evaluate( self, response: str, query: str, context: str = "" ) -> dict: """Evaluate response using LLM-as-judge.""" prompt = JUDGE_PROMPT.format( response=response, query=query, context=context ) evaluation = llm_client.chat.completions.create( model=self.model, messages=[{"role": "user", "content": prompt}], temperature=0.0, # Deterministic response_format={"type": "json_object"} ) result = json.loads( evaluation.choices[0].message.content ) # Add pass/fail threshold result["passed"] = result["overall"] >= 4.0 return result
Reducing Judge Bias
One judge is biased. Use multiple judges and consensus:
class ConsensusJudge: def __init__(self, judges: List[str] = None): self.judges = judges or ["gpt-4", "claude-3-5-sonnet"] def evaluate_consensus( self, response: str, query: str ) -> dict: """Get consensus from multiple judges.""" evaluations = [] for judge_model in self.judges: judge = LLMJudge(model=judge_model) eval_result = judge.evaluate(response, query) evaluations.append(eval_result) # Calculate consensus scores consensus = { "accuracy": self._consensus_score( [e["accuracy"]["score"] for e in evaluations] ), "relevance": self._consensus_score( [e["relevance"]["score"] for e in evaluations] ), "overall": np.mean([e["overall"] for e in evaluations]), "agreement": self._calculate_agreement(evaluations), "individual_evaluations": evaluations } consensus["passed"] = consensus["overall"] >= 4.0 return consensus def _consensus_score(self, scores: List[int]) -> dict: """Calculate consensus metrics for a dimension.""" return { "mean": np.mean(scores), "std": np.std(scores), "min": min(scores), "max": max(scores), "agreement": "high" if np.std(scores) < 0.5 else "low" }
Cost Optimization
LLM-as-judge is expensive. Optimize:
class SamplingEvaluator: """Evaluate sample of outputs rather than all.""" def __init__(self, sample_rate: float = 0.1): self.sample_rate = sample_rate def should_evaluate(self, request_id: str) -> bool: """Deterministic sampling based on request ID.""" import hashlib hash_val = int(hashlib.md5(request_id.encode()).hexdigest(), 16) return (hash_val % 1000) / 1000 < self.sample_rate def evaluate_batch( self, responses: List[dict], judge: LLMJudge ) -> dict: """Evaluate sampled batch.""" to_evaluate = [ r for r in responses if self.should_evaluate(r["request_id"]) ] results = [] for resp in to_evaluate: result = judge.evaluate( resp["output"], resp["input"] ) results.append(result) # Extrapolate to full population pass_rate = sum(1 for r in results if r["passed"]) / len(results) return { "sampled_count": len(to_evaluate), "total_count": len(responses), "pass_rate": pass_rate, "confidence_interval": self._calculate_ci(pass_rate, len(to_evaluate)), "estimated_passed": int(pass_rate * len(responses)) }
Cost comparison:
- Evaluating 100% of outputs: ~$500/month for 10K requests
- Evaluating 10% sample: ~$50/month
- 90% cost savings with statistically valid confidence intervals
RAG-Specific Evaluation
For retrieval-augmented generation, you need additional metrics:
Faithfulness
Does the generated answer actually use the retrieved context?
def calculate_faithfulness(answer: str, contexts: List[str]) -> float: """Check if answer is grounded in retrieved contexts.""" # Extract claims from answer claims = extract_claims(answer) # Use NER or LLM grounded_claims = 0 for claim in claims: # Check if claim appears in any context for context in contexts: if semantic_similarity(claim, context) > 0.8: grounded_claims += 1 break return grounded_claims / len(claims) if claims else 0 def extract_claims(text: str) -> List[str]: """Extract factual claims from text.""" # Simplified: use LLM to extract claims prompt = f"Extract all factual claims from this text as a JSON list:\n\n{text}" response = llm_client.chat.completions.create( model="gpt-3.5-turbo", messages=[{"role": "user", "content": prompt}], response_format={"type": "json_object"} ) return json.loads(response.choices[0].message.content)["claims"]
Answer Relevance
Is the answer relevant to the question, not just the retrieved context?
def calculate_answer_relevance( question: str, answer: str, embedding_model ) -> float: """Calculate semantic similarity between question and answer.""" q_embedding = embedding_model.encode(question) a_embedding = embedding_model.encode(answer) return cosine_similarity(q_embedding, a_embedding)
Context Retrieval Accuracy
Did we retrieve the right chunks?
def evaluate_retrieval( query: str, retrieved_chunks: List[str], ground_truth_chunks: List[str] # Labeled dataset ) -> dict: """Evaluate retrieval quality.""" # Precision: % of retrieved that are relevant relevant_retrieved = sum( 1 for chunk in retrieved_chunks if chunk in ground_truth_chunks ) precision = relevant_retrieved / len(retrieved_chunks) # Recall: % of relevant that were retrieved recall = relevant_retrieved / len(ground_truth_chunks) # F1 score f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0 return { "precision": precision, "recall": recall, "f1": f1, "retrieved_count": len(retrieved_chunks), "relevant_count": len(ground_truth_chunks) }
Tier 4: A/B Testing in Production
The ultimate evaluation: which version do users prefer?
Shadow Mode
Test new model without affecting users:
class ShadowModeEvaluator: """Run new model in shadow mode.""" def __init__(self, production_model: str, candidate_model: str): self.prod_model = production_model self.candidate_model = candidate_model self.sample_rate = 0.1 async def generate_with_shadow( self, prompt: str, request_id: str ) -> dict: """Generate with production model, shadow test candidate.""" # Always call production model (return this to user) prod_response = await self._call_model(self.prod_model, prompt) # Sample for shadow evaluation if self._should_shadow(request_id): try: # Call candidate model (don't block response) candidate_response = await asyncio.wait_for( self._call_model(self.candidate_model, prompt), timeout=30.0 ) # Log both responses for comparison await self._log_comparison( request_id, prompt, prod_response, candidate_response ) except asyncio.TimeoutError: print(f"Shadow call timed out for {request_id}") return prod_response
Gradual Rollout
class GradualRollout: """Gradually shift traffic to new model.""" def __init__(self): self.rollout_percentage = 0 # Start at 0% def select_model(self, user_id: str) -> str: """Route user to model based on rollout percentage.""" # Deterministic routing hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16) user_bucket = hash_val % 100 if user_bucket < self.rollout_percentage: return "new_model" else: return "production_model" def evaluate_rollout(self, days: int = 7) -> dict: """Evaluate metrics during rollout.""" metrics = { "new_model": self._get_metrics("new_model", days), "production_model": self._get_metrics("production_model", days) } # Statistical significance test from scipy import stats t_stat, p_value = stats.ttest_ind( metrics["new_model"]["user_satisfaction"], metrics["production_model"]["user_satisfaction"] ) return { "new_model_better": metrics["new_model"]["mean"] > metrics["production_model"]["mean"], "statistically_significant": p_value < 0.05, "p_value": p_value, "recommended_action": "increase_rollout" if (p_value < 0.05 and metrics["new_model"]["mean"] > metrics["production_model"]["mean"]) else "hold" }
User Feedback Loops
class FeedbackCollector: """Collect explicit user feedback (thumbs up/down).""" def record_feedback( self, request_id: str, user_id: str, feedback: str, # "positive" or "negative" comment: str = "" ): """Store user feedback.""" feedback_record = { "request_id": request_id, "user_id": user_id, "feedback": feedback, "comment": comment, "timestamp": datetime.utcnow(), "model_version": self._get_model_for_request(request_id) } # Store in database self.db.feedback.insert_one(feedback_record) # Real-time alert if negative feedback spike self._check_feedback_spike() def calculate_satisfaction_rate( self, model_version: str, days: int = 7 ) -> float: """Calculate satisfaction rate for model version.""" feedbacks = self.db.feedback.find({ "model_version": model_version, "timestamp": {"$gte": datetime.utcnow() - timedelta(days=days)} }) total = feedbacks.count() positive = sum(1 for f in feedbacks if f["feedback"] == "positive") return positive / total if total > 0 else 0
Building Your Evaluation Framework
Offline Evaluation Dataset
Before deploying, build a labeled dataset:
# examples/evaluation_dataset.jsonl { "id": "eval_001", "input": "What are the system requirements?", "context": "System Requirements: 8GB RAM, 4 CPU cores, Python 3.9+", "expected_output": "You need 8GB RAM, 4 CPU cores, and Python 3.9 or higher.", "evaluation_criteria": { "min_length": 20, "must_contain": ["8GB", "4 CPU", "Python 3.9"] } }
CI/CD Integration
Block deployment on regression:
# tests/test_llm_regression.py import pytest @pytest.mark.evaluation async def test_no_regression(): """Ensure new model doesn't regress on evaluation set.""" # Load evaluation dataset eval_dataset = load_eval_dataset() # Run current model current_results = [] for example in eval_dataset: result = await current_model.generate(example["input"]) current_results.append({ "example_id": example["id"], "output": result, "passed": evaluate_output(result, example["evaluation_criteria"]) }) current_pass_rate = sum(1 for r in current_results if r["passed"]) / len(current_results) # Compare to baseline baseline_pass_rate = 0.85 # From previous run assert current_pass_rate >= baseline_pass_rate - 0.02, \ f"Regression detected: {current_pass_rate:.2%} vs baseline {baseline_pass_rate:.2%}"
Monitoring Dashboard
Track these metrics:
# Prometheus metrics EVALUATION_PASS_RATE = Gauge( 'llm_eval_pass_rate', 'Pass rate by evaluation tier', ['tier'] ) JUDGE_SCORE = Histogram( 'llm_judge_score', 'LLM-as-judge scores', ['dimension'] ) RAG_FAITHFULNESS = Gauge( 'rag_faithfulness', 'RAG faithfulness score' ) USER_SATISFACTION = Gauge( 'user_satisfaction_rate', 'User thumbs up rate' )
What Good Looks Like
Real numbers from production systems:
| Metric | Target | Good | Excellent |
|---|---|---|---|
| Deterministic pass rate | > 95% | 97% | 99% |
| LLM-as-judge pass rate | > 80% | 85% | 90% |
| RAG faithfulness | > 70% | 80% | 90% |
| User satisfaction | > 75% | 85% | 90% |
| A/B test significance | < 0.05 | 0.01 | 0.001 |
Cost per evaluation:
- Deterministic: $0.0001 (running code)
- LLM-as-judge: $0.005 per evaluation
- Human evaluation: $0.50-$2.00 per sample
The "Don't" List
- ❌ Don't rely solely on ROUGE/BERTScore for generative tasks
- ❌ Don't use single-judge evaluation (bias is real)
- ❌ Don't evaluate 100% of traffic (wasteful)
- ❌ Don't skip deterministic checks (they're free!)
- ❌ Don't ignore user feedback (ground truth)
Production Checklist
Before trusting your evaluation:
- Deterministic checks for all structured outputs
- LLM-as-judge rubric defined and tested
- Sampling strategy implemented (10-20%)
- RAG metrics if using retrieval (faithfulness, relevance)
- A/B testing framework ready
- User feedback collection active
- CI/CD evaluation gates configured
- Dashboard with pass rates by tier
- Alerting on evaluation failure spikes
Next Steps
Evaluation tells you if your system works. But how do you decide which technique to use in the first place? In the final post, I'll share the decision framework I use to choose between prompt engineering, RAG, and fine-tuning—complete with ROI analysis.
Code examples: Evaluation framework GitHub
Related: