LLM EvaluationLLM-as-JudgeGenerative AI MetricsRAG EvaluationA/B TestingProduction AI

Evaluating Generative AI in Production: Metrics Beyond 'Correct' and 'Incorrect'

Published April 24, 202614 min read

Evaluating Generative AI in Production: Metrics Beyond "Correct" and "Incorrect"

I shipped the feature. The demo looked great. Users started complaining.

The LLM was generating responses that sounded good but were subtly wrong—wrong dates, fabricated statistics, confidently incorrect answers. And I had no systematic way to catch it. I was manually reviewing 50 random outputs and hoping for the best.

Traditional ML evaluation doesn't work for generative AI. ROUGE and BLEU scores measure text overlap, not truth. Accuracy requires ground truth, which doesn't exist for creative tasks. In this post, I'll show you the four-tier evaluation framework I use to ship reliable generative AI.

The Evaluation Pyramid

Not all evaluation is equal. Start cheap and deterministic, then add expensive judgment layers.

Evaluation Pyramid

The four tiers:

Deterministic checks (cheap, instant, always run)
Heuristic metrics (ROUGE, BERTScore—limited value)
LLM-as-judge (expensive, nuanced, sample-based)
Human evaluation (gold standard, slow, expensive)

In the Document Extraction Pipeline, I use all four tiers: JSON schema validation (deterministic), semantic similarity (heuristic), GPT-4 grading (LLM-as-judge), and weekly human audits.

Tier 1: Deterministic Checks That Actually Matter

Before you call any LLM API, verify the output is structurally valid. These are your first line of defense.

JSON Schema Validation

For structured outputs (extraction, classification, configuration), enforce schemas with Pydantic:

from pydantic import BaseModel, Field, validator
from typing import List, Optional
import json

class InvoiceExtraction(BaseModel):
    """Schema for invoice data extraction."""
    invoice_number: str = Field(..., min_length=1, max_length=50)
    date: str = Field(..., pattern=r"^\d{4}-\d{2}-\d{2}quot;)
    total_amount: float = Field(..., gt=0)
    vendor_name: str = Field(..., min_length=1)
    line_items: List[dict] = Field(..., min_items=1)
    
    @validator('date')
    def validate_date_format(cls, v):
        from datetime import datetime
        try:
            datetime.strptime(v, '%Y-%m-%d')
            return v
        except ValueError:
            raise ValueError('Invalid date format')
    
    @validator('total_amount')
    def validate_reasonable_amount(cls, v):
        if v > 1000000:
            raise ValueError('Amount seems unreasonably high')
        return v

def validate_llm_output(raw_output: str) -> tuple[bool, Optional[InvoiceExtraction]]:
    """Validate and parse LLM output."""
    try:
        # Parse JSON
        data = json.loads(raw_output)
        
        # Validate against schema
        validated = InvoiceExtraction(**data)
        
        return True, validated
        
    except json.JSONDecodeError as e:
        print(f"Invalid JSON: {e}")
        return False, None
    except Exception as e:
        print(f"Schema validation failed: {e}")
        return False, None

# Usage in extraction pipeline
def extract_invoice(document_text: str) -> InvoiceExtraction:
    prompt = f"""Extract invoice data from this text as JSON:
    
{document_text}

Return valid JSON matching this schema:
- invoice_number: string
- date: YYYY-MM-DD format
- total_amount: positive number
- vendor_name: string
- line_items: array of objects"""
    
    response = llm_client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    
    raw_output = response.choices[0].message.content
    
    # Clean up markdown code blocks if present
    raw_output = raw_output.replace("```json", "").replace("```", "").strip()
    
    is_valid, parsed = validate_llm_output(raw_output)
    
    if not is_valid:
        # Retry with stronger prompt or escalate
        raise ExtractionValidationError("Failed to extract valid invoice data")
    
    return parsed

Results from production:

8% of LLM outputs failed schema validation initially
After prompt engineering: 2% failure rate
Automatic retry logic catches 95% of remaining failures

Format Compliance Checks

Beyond JSON, check output meets business rules:

class OutputValidator:
    """Deterministic checks for LLM outputs."""
    
    @staticmethod
    def check_length(output: str, min_len: int = 10, max_len: int = 2000) -> bool:
        """Check response is within length bounds."""
        return min_len <= len(output) <= max_len
    
    @staticmethod
    def check_forbidden_words(output: str, forbidden: List[str]) -> bool:
        """Check response doesn't contain prohibited terms."""
        output_lower = output.lower()
        return not any(word.lower() in output_lower for word in forbidden)
    
    @staticmethod
    def check_required_sections(output: str, required: List[str]) -> bool:
        """Check all required sections are present."""
        output_lower = output.lower()
        return all(section.lower() in output_lower for section in required)
    
    @staticmethod
    def check_json_validity(output: str) -> bool:
        """Check if output is valid JSON."""
        try:
            json.loads(output)
            return True
        except json.JSONDecodeError:
            return False
    
    @staticmethod
    def check_language(output: str, allowed_languages: List[str]) -> bool:
        """Check output is in allowed language."""
        # Simplified check - use langdetect library in production
        return True  # Placeholder

# Usage
def validate_support_response(response: str) -> Dict[str, any]:
    """Validate customer support response."""
    checks = {
        "length_ok": OutputValidator.check_length(response, 50, 1000),
        "no_profanity": OutputValidator.check_forbidden_words(
            response, ["stupid", "idiot", "dumb"]
        ),
        "has_greeting": OutputValidator.check_required_sections(
            response, ["Hello", "Hi"]
        ),
        "has_closing": OutputValidator.check_required_sections(
            response, ["Best", "Regards", "Thanks"]
        ),
    }
    
    return {
        "all_passed": all(checks.values()),
        "checks": checks,
        "failed_checks": [k for k, v in checks.items() if not v]
    }

Latency and Performance Checks

import time
from dataclasses import dataclass

@dataclass
class LLMResponse:
    content: str
    latency_ms: float
    tokens_input: int
    tokens_output: int
    model: str

def call_llm_with_validation(prompt: str) -> tuple[bool, Optional[LLMResponse]]:
    """Call LLM with full validation."""
    start = time.time()
    
    response = llm_client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=500
    )
    
    latency_ms = (time.time() - start) * 1000
    
    result = LLMResponse(
        content=response.choices[0].message.content,
        latency_ms=latency_ms,
        tokens_input=response.usage.prompt_tokens,
        tokens_output=response.usage.completion_tokens,
        model="gpt-4"
    )
    
    # Performance checks
    if latency_ms > 10000:  # 10 seconds
        print(f"Warning: High latency {latency_ms}ms")
        return False, None
    
    if result.tokens_output > 450:  # Near limit, probably truncated
        print("Warning: Response may be truncated")
    
    return True, result

Tier 2: Heuristic Metrics (Limited Value)

ROUGE, BLEU, BERTScore measure text similarity. They're okay for summarization, useless for creative tasks.

from rouge import Rouge
from bert_score import score

def calculate_rouge(reference: str, hypothesis: str) -> dict:
    """Calculate ROUGE scores."""
    rouge = Rouge()
    scores = rouge.get_scores(hypothesis, reference)[0]
    return {
        "rouge-1": scores["rouge-1"]["f"],
        "rouge-2": scores["rouge-2"]["f"],
        "rouge-l": scores["rouge-l"]["f"]
    }

def calculate_bertscore(references: List[str], candidates: List[str]):
    """Calculate BERTScore (semantic similarity)."""
    P, R, F1 = score(candidates, references, lang="en")
    return {
        "precision": P.mean().item(),
        "recall": R.mean().item(),
        "f1": F1.mean().item()
    }

# My take: These are trailing indicators. If your ROUGE score drops, 
# something's already broken. Don't rely on them for quality gates.

Tier 3: LLM-as-Judge (The Workhorse)

Use a capable model (GPT-4, Claude 3.5) to evaluate outputs against strict rubrics.

Building the Judge Prompt

JUDGE_PROMPT = """You are an expert evaluator of AI-generated responses. 
Evaluate the following response based on the criteria below.

## Response to Evaluate
{response}

## Original Query
{query}

## Context (if any)
{context}

## Evaluation Criteria

Score each dimension 1-5:

1. **Accuracy** (1-5): Is the information factually correct?
   - 5: Completely accurate, no errors
   - 4: Mostly accurate, minor errors
   - 3: Some accurate info, some errors
   - 2: Mostly incorrect
   - 1: Completely wrong or fabricated

2. **Relevance** (1-5): Does it directly address the query?
   - 5: Perfectly addresses the query
   - 4: Mostly relevant with minor tangents
   - 3: Partially relevant
   - 2: Barely relevant
   - 1: Completely off-topic

3. **Completeness** (1-5): Does it cover all aspects of the query?
   - 5: Comprehensive coverage
   - 4: Good coverage, minor gaps
   - 3: Partial coverage
   - 2: Major gaps
   - 1: Incomplete

4. **Tone** (1-5): Is the tone appropriate?
   - 5: Perfect tone (professional, helpful)
   - 4: Good tone, minor issues
   - 3: Acceptable tone
   - 2: Tone issues
   - 1: Inappropriate tone

5. **Safety** (1-5): Is the response safe and ethical?
   - 5: Completely safe
   - 4: Minor concerns
   - 3: Some concerns
   - 2: Significant issues
   - 1: Harmful or unsafe

## Instructions
1. Provide scores for each dimension
2. Explain your reasoning for each score
3. Identify specific issues if any
4. Suggest improvements if score < 4

## Output Format
Return JSON only:
{{
    "accuracy": {{"score": 4, "reasoning": "..."}},
    "relevance": {{"score": 5, "reasoning": "..."}},
    "completeness": {{"score": 3, "reasoning": "..."}},
    "tone": {{"score": 5, "reasoning": "..."}},
    "safety": {{"score": 5, "reasoning": "..."}},
    "overall": 4.4,
    "issues": ["issue1", "issue2"],
    "suggestions": ["suggestion1"]
}}"""

class LLMJudge:
    def __init__(self, model: str = "gpt-4"):
        self.model = model
    
    def evaluate(
        self,
        response: str,
        query: str,
        context: str = ""
    ) -> dict:
        """Evaluate response using LLM-as-judge."""
        
        prompt = JUDGE_PROMPT.format(
            response=response,
            query=query,
            context=context
        )
        
        evaluation = llm_client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0,  # Deterministic
            response_format={"type": "json_object"}
        )
        
        result = json.loads(
            evaluation.choices[0].message.content
        )
        
        # Add pass/fail threshold
        result["passed"] = result["overall"] >= 4.0
        
        return result

Reducing Judge Bias

One judge is biased. Use multiple judges and consensus:

class ConsensusJudge:
    def __init__(self, judges: List[str] = None):
        self.judges = judges or ["gpt-4", "claude-3-5-sonnet"]
    
    def evaluate_consensus(
        self,
        response: str,
        query: str
    ) -> dict:
        """Get consensus from multiple judges."""
        
        evaluations = []
        for judge_model in self.judges:
            judge = LLMJudge(model=judge_model)
            eval_result = judge.evaluate(response, query)
            evaluations.append(eval_result)
        
        # Calculate consensus scores
        consensus = {
            "accuracy": self._consensus_score(
                [e["accuracy"]["score"] for e in evaluations]
            ),
            "relevance": self._consensus_score(
                [e["relevance"]["score"] for e in evaluations]
            ),
            "overall": np.mean([e["overall"] for e in evaluations]),
            "agreement": self._calculate_agreement(evaluations),
            "individual_evaluations": evaluations
        }
        
        consensus["passed"] = consensus["overall"] >= 4.0
        
        return consensus
    
    def _consensus_score(self, scores: List[int]) -> dict:
        """Calculate consensus metrics for a dimension."""
        return {
            "mean": np.mean(scores),
            "std": np.std(scores),
            "min": min(scores),
            "max": max(scores),
            "agreement": "high" if np.std(scores) < 0.5 else "low"
        }

Cost Optimization

LLM-as-judge is expensive. Optimize:

class SamplingEvaluator:
    """Evaluate sample of outputs rather than all."""
    
    def __init__(self, sample_rate: float = 0.1):
        self.sample_rate = sample_rate
    
    def should_evaluate(self, request_id: str) -> bool:
        """Deterministic sampling based on request ID."""
        import hashlib
        hash_val = int(hashlib.md5(request_id.encode()).hexdigest(), 16)
        return (hash_val % 1000) / 1000 < self.sample_rate
    
    def evaluate_batch(
        self,
        responses: List[dict],
        judge: LLMJudge
    ) -> dict:
        """Evaluate sampled batch."""
        
        to_evaluate = [
            r for r in responses 
            if self.should_evaluate(r["request_id"])
        ]
        
        results = []
        for resp in to_evaluate:
            result = judge.evaluate(
                resp["output"],
                resp["input"]
            )
            results.append(result)
        
        # Extrapolate to full population
        pass_rate = sum(1 for r in results if r["passed"]) / len(results)
        
        return {
            "sampled_count": len(to_evaluate),
            "total_count": len(responses),
            "pass_rate": pass_rate,
            "confidence_interval": self._calculate_ci(pass_rate, len(to_evaluate)),
            "estimated_passed": int(pass_rate * len(responses))
        }

Cost comparison:

Evaluating 100% of outputs: ~$500/month for 10K requests
Evaluating 10% sample: ~$50/month
90% cost savings with statistically valid confidence intervals

RAG-Specific Evaluation

For retrieval-augmented generation, you need additional metrics:

Faithfulness

Does the generated answer actually use the retrieved context?

def calculate_faithfulness(answer: str, contexts: List[str]) -> float:
    """Check if answer is grounded in retrieved contexts."""
    
    # Extract claims from answer
    claims = extract_claims(answer)  # Use NER or LLM
    
    grounded_claims = 0
    for claim in claims:
        # Check if claim appears in any context
        for context in contexts:
            if semantic_similarity(claim, context) > 0.8:
                grounded_claims += 1
                break
    
    return grounded_claims / len(claims) if claims else 0

def extract_claims(text: str) -> List[str]:
    """Extract factual claims from text."""
    # Simplified: use LLM to extract claims
    prompt = f"Extract all factual claims from this text as a JSON list:\n\n{text}"
    response = llm_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content)["claims"]

Answer Relevance

Is the answer relevant to the question, not just the retrieved context?

def calculate_answer_relevance(
    question: str,
    answer: str,
    embedding_model
) -> float:
    """Calculate semantic similarity between question and answer."""
    
    q_embedding = embedding_model.encode(question)
    a_embedding = embedding_model.encode(answer)
    
    return cosine_similarity(q_embedding, a_embedding)

Context Retrieval Accuracy

Did we retrieve the right chunks?

def evaluate_retrieval(
    query: str,
    retrieved_chunks: List[str],
    ground_truth_chunks: List[str]  # Labeled dataset
) -> dict:
    """Evaluate retrieval quality."""
    
    # Precision: % of retrieved that are relevant
    relevant_retrieved = sum(
        1 for chunk in retrieved_chunks
        if chunk in ground_truth_chunks
    )
    precision = relevant_retrieved / len(retrieved_chunks)
    
    # Recall: % of relevant that were retrieved
    recall = relevant_retrieved / len(ground_truth_chunks)
    
    # F1 score
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    return {
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "retrieved_count": len(retrieved_chunks),
        "relevant_count": len(ground_truth_chunks)
    }

Tier 4: A/B Testing in Production

The ultimate evaluation: which version do users prefer?

Shadow Mode

Test new model without affecting users:

class ShadowModeEvaluator:
    """Run new model in shadow mode."""
    
    def __init__(self, production_model: str, candidate_model: str):
        self.prod_model = production_model
        self.candidate_model = candidate_model
        self.sample_rate = 0.1
    
    async def generate_with_shadow(
        self,
        prompt: str,
        request_id: str
    ) -> dict:
        """Generate with production model, shadow test candidate."""
        
        # Always call production model (return this to user)
        prod_response = await self._call_model(self.prod_model, prompt)
        
        # Sample for shadow evaluation
        if self._should_shadow(request_id):
            try:
                # Call candidate model (don't block response)
                candidate_response = await asyncio.wait_for(
                    self._call_model(self.candidate_model, prompt),
                    timeout=30.0
                )
                
                # Log both responses for comparison
                await self._log_comparison(
                    request_id,
                    prompt,
                    prod_response,
                    candidate_response
                )
                
            except asyncio.TimeoutError:
                print(f"Shadow call timed out for {request_id}")
        
        return prod_response

Gradual Rollout

class GradualRollout:
    """Gradually shift traffic to new model."""
    
    def __init__(self):
        self.rollout_percentage = 0  # Start at 0%
    
    def select_model(self, user_id: str) -> str:
        """Route user to model based on rollout percentage."""
        
        # Deterministic routing
        hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
        user_bucket = hash_val % 100
        
        if user_bucket < self.rollout_percentage:
            return "new_model"
        else:
            return "production_model"
    
    def evaluate_rollout(self, days: int = 7) -> dict:
        """Evaluate metrics during rollout."""
        
        metrics = {
            "new_model": self._get_metrics("new_model", days),
            "production_model": self._get_metrics("production_model", days)
        }
        
        # Statistical significance test
        from scipy import stats
        
        t_stat, p_value = stats.ttest_ind(
            metrics["new_model"]["user_satisfaction"],
            metrics["production_model"]["user_satisfaction"]
        )
        
        return {
            "new_model_better": metrics["new_model"]["mean"] > metrics["production_model"]["mean"],
            "statistically_significant": p_value < 0.05,
            "p_value": p_value,
            "recommended_action": "increase_rollout" if (p_value < 0.05 and metrics["new_model"]["mean"] > metrics["production_model"]["mean"]) else "hold"
        }

User Feedback Loops

class FeedbackCollector:
    """Collect explicit user feedback (thumbs up/down)."""
    
    def record_feedback(
        self,
        request_id: str,
        user_id: str,
        feedback: str,  # "positive" or "negative"
        comment: str = ""
    ):
        """Store user feedback."""
        
        feedback_record = {
            "request_id": request_id,
            "user_id": user_id,
            "feedback": feedback,
            "comment": comment,
            "timestamp": datetime.utcnow(),
            "model_version": self._get_model_for_request(request_id)
        }
        
        # Store in database
        self.db.feedback.insert_one(feedback_record)
        
        # Real-time alert if negative feedback spike
        self._check_feedback_spike()
    
    def calculate_satisfaction_rate(
        self,
        model_version: str,
        days: int = 7
    ) -> float:
        """Calculate satisfaction rate for model version."""
        
        feedbacks = self.db.feedback.find({
            "model_version": model_version,
            "timestamp": {"$gte": datetime.utcnow() - timedelta(days=days)}
        })
        
        total = feedbacks.count()
        positive = sum(1 for f in feedbacks if f["feedback"] == "positive")
        
        return positive / total if total > 0 else 0

Building Your Evaluation Framework

Offline Evaluation Dataset

Before deploying, build a labeled dataset:

# examples/evaluation_dataset.jsonl
{
    "id": "eval_001",
    "input": "What are the system requirements?",
    "context": "System Requirements: 8GB RAM, 4 CPU cores, Python 3.9+",
    "expected_output": "You need 8GB RAM, 4 CPU cores, and Python 3.9 or higher.",
    "evaluation_criteria": {
        "min_length": 20,
        "must_contain": ["8GB", "4 CPU", "Python 3.9"]
    }
}

CI/CD Integration

Block deployment on regression:

# tests/test_llm_regression.py
import pytest

@pytest.mark.evaluation
async def test_no_regression():
    """Ensure new model doesn't regress on evaluation set."""
    
    # Load evaluation dataset
    eval_dataset = load_eval_dataset()
    
    # Run current model
    current_results = []
    for example in eval_dataset:
        result = await current_model.generate(example["input"])
        current_results.append({
            "example_id": example["id"],
            "output": result,
            "passed": evaluate_output(result, example["evaluation_criteria"])
        })
    
    current_pass_rate = sum(1 for r in current_results if r["passed"]) / len(current_results)
    
    # Compare to baseline
    baseline_pass_rate = 0.85  # From previous run
    
    assert current_pass_rate >= baseline_pass_rate - 0.02, \
        f"Regression detected: {current_pass_rate:.2%} vs baseline {baseline_pass_rate:.2%}"

Monitoring Dashboard

Track these metrics:

# Prometheus metrics
EVALUATION_PASS_RATE = Gauge(
    'llm_eval_pass_rate', 
    'Pass rate by evaluation tier',
    ['tier']
)

JUDGE_SCORE = Histogram(
    'llm_judge_score',
    'LLM-as-judge scores',
    ['dimension']
)

RAG_FAITHFULNESS = Gauge(
    'rag_faithfulness',
    'RAG faithfulness score'
)

USER_SATISFACTION = Gauge(
    'user_satisfaction_rate',
    'User thumbs up rate'
)

What Good Looks Like

Real numbers from production systems:

Metric	Target	Good	Excellent
Deterministic pass rate	> 95%	97%	99%
LLM-as-judge pass rate	> 80%	85%	90%
RAG faithfulness	> 70%	80%	90%
User satisfaction	> 75%	85%	90%
A/B test significance	< 0.05	0.01	0.001

Cost per evaluation:

Deterministic: $0.0001 (running code)
LLM-as-judge: $0.005 per evaluation
Human evaluation: $0.50-$2.00 per sample

The "Don't" List

❌ Don't rely solely on ROUGE/BERTScore for generative tasks
❌ Don't use single-judge evaluation (bias is real)
❌ Don't evaluate 100% of traffic (wasteful)
❌ Don't skip deterministic checks (they're free!)
❌ Don't ignore user feedback (ground truth)

Production Checklist

Before trusting your evaluation:

Deterministic checks for all structured outputs
LLM-as-judge rubric defined and tested
Sampling strategy implemented (10-20%)
RAG metrics if using retrieval (faithfulness, relevance)
A/B testing framework ready
User feedback collection active
CI/CD evaluation gates configured
Dashboard with pass rates by tier
Alerting on evaluation failure spikes

Next Steps

Evaluation tells you if your system works. But how do you decide which technique to use in the first place? In the final post, I'll share the decision framework I use to choose between prompt engineering, RAG, and fine-tuning—complete with ROI analysis.

Code examples: Evaluation framework GitHub

Related:

Questions? Email me or connect on LinkedIn.