Building a 4-Tier Verification Layer for LLM Outputs in Regulated Domains
Building a 4-Tier Verification Layer for LLM Outputs in Regulated Domains
A client in a regulated industry asked me to build an AI research tool. The catch: every citation in the output had to be real. Not "probably real." Not "looks right." Real — traceable to a specific ruling ID in the knowledge base, verified against the source text, with zero tolerance for fabrication.
In most LLM applications, a hallucinated fact is a bad user experience. In regulated domains — finance, law, healthcare, tax — it is a liability event. A fabricated ruling number in a client-facing document can trigger professional indemnity claims, regulatory action, and loss of practitioner credentials.
I needed a verification layer where the user never sees a fabricated citation. Here is how I built it.
Why Existing Approaches Fall Short
The obvious first attempt is embedding similarity: embed the LLM output and the source chunks, compute cosine similarity, flag anything below a threshold. I tried this. It catches maybe 72% of problems.
The failure mode is subtle. Cosine similarity measures whether two texts are about the same topic, not whether one supports the other. An LLM can generate a claim that is topically related to a source chunk but factually contradicts it — and similarity scoring will call it a match.
Example: The source says "Section 109D applies to loans made after 4 December 1997." The LLM outputs "Section 109D applies to all private company loans." Same topic, high similarity score, completely wrong scope.
The second attempt — running an LLM-as-judge on every claim — works better but costs $0.15+ per query and adds 8-10 seconds of latency. For a tool handling hundreds of queries per day, this is unsustainable.
I wrote about evaluation metrics in Evaluating Generative AI in Production. That post covers measurement — how to know if your system is accurate. This post is about enforcement — how to structurally prevent bad output from reaching the user.
The 4-Tier Architecture
The insight that unlocked the design: different verification techniques have different cost-accuracy profiles. Cheap methods handle the easy cases; expensive methods handle only the hard cases. Stack them.
Tier 1: Claim Extraction (Claude)
Before you can verify claims, you need to identify them. Raw LLM output is prose — paragraphs mixing analysis, citations, qualifications, and conclusions. Verification requires atomic claims.
Claude decomposes the output into individually verifiable statements:
EXTRACTION_PROMPT = """Extract every factual claim from the text below. For each claim, return: - claim_text: the atomic statement - claim_type: quantitative | explicit | conditional - cited_source: any ruling/section ID referenced Return JSON array. Do not add claims not present in the text.""" async def extract_claims(output_text: str) -> list[dict]: response = await claude.messages.create( model="claude-sonnet-4-20250514", max_tokens=2000, messages=[{"role": "user", "content": f"{EXTRACTION_PROMPT}\n\n{output_text}"}], ) return json.loads(response.content[0].text)
This step costs ~$0.03 and takes ~2 seconds. A typical skill output produces 8-15 atomic claims.
Tier 2: Parallel NLI Cross-Encoder (Local GPU)
This is the core of the pipeline. Instead of embedding similarity, I use a Natural Language Inference (NLI) cross-encoder — a model explicitly trained to classify whether a hypothesis is entailed by, contradicted by, or neutral to a premise.
The model is cross-encoder/nli-deberta-v3-large. It runs locally on a GPU. Zero API cost per call.
from sentence_transformers import CrossEncoder nli_model = CrossEncoder("cross-encoder/nli-deberta-v3-large", device="cuda") def score_claims(claims: list[dict], chunks: list[str]) -> list[dict]: pairs = [] claim_indices = [] for i, claim in enumerate(claims): for chunk in chunks: pairs.append((chunk, claim["claim_text"])) claim_indices.append(i) scores = nli_model.predict(pairs, batch_size=32) # scores shape: (n_pairs, 3) for [contradiction, neutral, entailment] results = [{"max_entailment": 0.0} for _ in claims] for idx, score in zip(claim_indices, scores): entailment = float(score[2]) if entailment > results[idx]["max_entailment"]: results[idx]["max_entailment"] = entailment results[idx]["contradiction"] = float(score[0]) return results
Why NLI cross-encoder over embedding similarity:
| Approach | What it measures | Accuracy on entailment |
|---|---|---|
| Cosine similarity | Topical relatedness | ~72% |
| NLI cross-encoder | Logical entailment/contradiction | ~92% |
The cross-encoder explicitly classifies contradictions. Embedding similarity cannot distinguish "supports" from "contradicts" when the topic is the same. This is the difference between a verification layer that catches 72% of problems and one that catches 92%.
Batch inference on a GPU processes 12 claims against 10 chunks (~120 pairs) in ~0.5 seconds. The entire tier costs $0.00.
Tier 3: Claude Judge (Borderline Cases Only)
Claims score in three bands:
- > 0.85 entailment: Supported. Move on.
- < 0.30 entailment: Unsupported. Flag it.
- 0.30 – 0.85: Borderline. These need human-level reasoning.
Only borderline claims go to the Claude judge. In practice, this is 15-25% of claims — the rest are resolved by the NLI model alone.
JUDGE_PROMPT = """You are a verification judge. For each claim below, determine if the source evidence supports it. Respond with: SUPPORTED, PARTIAL, or UNSUPPORTED. Include one sentence of reasoning. Claims: {claims} Evidence: {evidence}""" async def judge_borderline(borderline_claims: list[dict], evidence: list[str]) -> list[dict]: response = await claude.messages.create( model="claude-sonnet-4-20250514", max_tokens=1500, messages=[{"role": "user", "content": JUDGE_PROMPT.format( claims=json.dumps(borderline_claims), evidence="\n".join(evidence), )}], ) return parse_judge_response(response.content[0].text)
Batching all borderline claims into a single LLM call keeps this at ~$0.02 and ~3 seconds.
Tier 4: Hard-Fail Gate
The final tier is binary. If any claim references a citation ID that does not exist in the knowledge base, or if the verification map contains any "Fabricated" classification, the response is blocked.
This is not a soft warning. It is an HTTP 422 with an incident log entry. The user sees "Verification failed" and the system logs the full state — which claims failed, which sources were checked, what the NLI scores were.
def apply_gate(verification_map: list[dict], kb_ruling_ids: set[str]) -> bool: for claim in verification_map: if claim.get("cited_source"): if claim["cited_source"] not in kb_ruling_ids: return False # fabricated citation ID if claim["status"] == "fabricated": return False return True
The hard-fail gate is the structural guarantee. Every other tier can have edge cases and accuracy gaps. This tier is deterministic: if the ruling ID is not in the knowledge base, the response does not ship. Period.
Integrating with LangGraph
The verification layer is not a utility function called from business logic. It is a mandatory node in the LangGraph state machine. There is no code path that bypasses it.
from langgraph.graph import StateGraph, END graph = StateGraph(AgentState) graph.add_node("parse", parse_query) graph.add_node("execute_skills", execute_skills) graph.add_node("verify", run_verification_pipeline) graph.add_node("compose", compose_result) graph.add_node("hard_fail", log_incident_and_block) graph.add_edge("parse", "execute_skills") graph.add_edge("execute_skills", "verify") graph.add_conditional_edges("verify", check_gate, { "pass": "compose", "fail": "hard_fail", }) graph.add_edge("compose", END) graph.add_edge("hard_fail", END)
Three properties make this trustworthy:
- Verification is a node, not a middleware. It has its own state, its own error handling, and its own audit trail.
- No edge connects
execute_skillstocomposedirectly. The graph structure makes bypassing verification impossible, not just unlikely. - Hard-fail is a first-class node. Incident logging happens in a dedicated state with its own persistence, not in a
try/catchburied in business logic.
The state object carries the complete audit trail — retrieved chunks, LLM prompt, model version, NLI scores, judge reasoning, gate decision. Every query is fully reproducible.
The Economics
The tiered approach is dramatically cheaper than running full LLM verification on every claim:
| Component | Accuracy | Cost / query | Latency |
|---|---|---|---|
| Claude claim extraction | ~95% | $0.03 | ~2s |
| Local NLI cross-encoder | ~92% NLI | $0.00 | ~0.5s |
| Claude judge (borderline) | ~96% | $0.02 | ~3s |
| Total | ~94% e2e | $0.05 | ~5.5s |
Running Claude as judge on every claim (no NLI pre-filter) would cost $0.15+ per query. The NLI tier eliminates 75-85% of claims from the expensive path, saving $0.10 per query at scale.
At 1,000 queries/day: $50/day with the pipeline vs $150/day without. Over a month, the NLI model pays for its GPU instance many times over.
The local NLI model also eliminates an API dependency. If the Claude API has a latency spike, verification still runs — the NLI tier operates independently on local hardware.
Production Checklist
If you are building verification for a regulated-domain LLM system:
- Start with the gate, not the model. The hard-fail on fabricated citation IDs is the cheapest, most reliable check. Implement it first.
- Use NLI, not similarity. If your system makes factual claims against source documents, a cross-encoder trained for entailment is the right tool. Embedding similarity is for search, not verification.
- Batch the expensive tier. Sending borderline claims to the LLM judge one at a time multiplies cost and latency. Batch them.
- Make verification structural. If it is possible to bypass verification via a code path, someone will. Use a state machine where the graph topology enforces the flow.
- Log everything. The verification map — claims, scores, sources, gate decision — is your audit trail. In regulated domains, reproducibility is not optional.
The verification layer I described here runs in production on a platform serving regulated professionals. The hard-fail gate has fired exactly twice in the first month — both on edge cases where the LLM referenced a superseded ruling that had been removed from the knowledge base. Both were caught before reaching any user.
Next in this series: Self-Hosted Mem0 for Persistent Agent Memory — how to store verified research findings so your agent builds institutional knowledge over time, instead of re-researching the same topics every session.