LLM VerificationNLI Cross-EncoderHallucination DetectionLangGraphProduction AITrust Layer

Building a 4-Tier Verification Layer for LLM Outputs in Regulated Domains

Published June 2, 20268 min read

Building a 4-Tier Verification Layer for LLM Outputs in Regulated Domains

A client in a regulated industry asked me to build an AI research tool. The catch: every citation in the output had to be real. Not "probably real." Not "looks right." Real — traceable to a specific ruling ID in the knowledge base, verified against the source text, with zero tolerance for fabrication.

In most LLM applications, a hallucinated fact is a bad user experience. In regulated domains — finance, law, healthcare, tax — it is a liability event. A fabricated ruling number in a client-facing document can trigger professional indemnity claims, regulatory action, and loss of practitioner credentials.

I needed a verification layer where the user never sees a fabricated citation. Here is how I built it.

Why Existing Approaches Fall Short

The obvious first attempt is embedding similarity: embed the LLM output and the source chunks, compute cosine similarity, flag anything below a threshold. I tried this. It catches maybe 72% of problems.

The failure mode is subtle. Cosine similarity measures whether two texts are about the same topic, not whether one supports the other. An LLM can generate a claim that is topically related to a source chunk but factually contradicts it — and similarity scoring will call it a match.

Example: The source says "Section 109D applies to loans made after 4 December 1997." The LLM outputs "Section 109D applies to all private company loans." Same topic, high similarity score, completely wrong scope.

The second attempt — running an LLM-as-judge on every claim — works better but costs $0.15+ per query and adds 8-10 seconds of latency. For a tool handling hundreds of queries per day, this is unsustainable.

I wrote about evaluation metrics in Evaluating Generative AI in Production. That post covers measurement — how to know if your system is accurate. This post is about enforcement — how to structurally prevent bad output from reaching the user.

The 4-Tier Architecture

The insight that unlocked the design: different verification techniques have different cost-accuracy profiles. Cheap methods handle the easy cases; expensive methods handle only the hard cases. Stack them.

4-Tier Verification Pipeline

Tier 1: Claim Extraction (Claude)

Before you can verify claims, you need to identify them. Raw LLM output is prose — paragraphs mixing analysis, citations, qualifications, and conclusions. Verification requires atomic claims.

Claude decomposes the output into individually verifiable statements:

EXTRACTION_PROMPT = """Extract every factual claim from the text below.
For each claim, return:
- claim_text: the atomic statement
- claim_type: quantitative | explicit | conditional
- cited_source: any ruling/section ID referenced

Return JSON array. Do not add claims not present in the text."""

async def extract_claims(output_text: str) -> list[dict]:
    response = await claude.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2000,
        messages=[{"role": "user", "content": f"{EXTRACTION_PROMPT}\n\n{output_text}"}],
    )
    return json.loads(response.content[0].text)

This step costs ~$0.03 and takes ~2 seconds. A typical skill output produces 8-15 atomic claims.

Tier 2: Parallel NLI Cross-Encoder (Local GPU)

This is the core of the pipeline. Instead of embedding similarity, I use a Natural Language Inference (NLI) cross-encoder — a model explicitly trained to classify whether a hypothesis is entailed by, contradicted by, or neutral to a premise.

The model is cross-encoder/nli-deberta-v3-large. It runs locally on a GPU. Zero API cost per call.

from sentence_transformers import CrossEncoder

nli_model = CrossEncoder("cross-encoder/nli-deberta-v3-large", device="cuda")

def score_claims(claims: list[dict], chunks: list[str]) -> list[dict]:
    pairs = []
    claim_indices = []
    for i, claim in enumerate(claims):
        for chunk in chunks:
            pairs.append((chunk, claim["claim_text"]))
            claim_indices.append(i)

    scores = nli_model.predict(pairs, batch_size=32)
    # scores shape: (n_pairs, 3) for [contradiction, neutral, entailment]

    results = [{"max_entailment": 0.0} for _ in claims]
    for idx, score in zip(claim_indices, scores):
        entailment = float(score[2])
        if entailment > results[idx]["max_entailment"]:
            results[idx]["max_entailment"] = entailment
            results[idx]["contradiction"] = float(score[0])

    return results

Why NLI cross-encoder over embedding similarity:

Approach	What it measures	Accuracy on entailment
Cosine similarity	Topical relatedness	~72%
NLI cross-encoder	Logical entailment/contradiction	~92%

The cross-encoder explicitly classifies contradictions. Embedding similarity cannot distinguish "supports" from "contradicts" when the topic is the same. This is the difference between a verification layer that catches 72% of problems and one that catches 92%.

Batch inference on a GPU processes 12 claims against 10 chunks (~120 pairs) in ~0.5 seconds. The entire tier costs $0.00.

Tier 3: Claude Judge (Borderline Cases Only)

Claims score in three bands:

> 0.85 entailment: Supported. Move on.
< 0.30 entailment: Unsupported. Flag it.
0.30 – 0.85: Borderline. These need human-level reasoning.

Only borderline claims go to the Claude judge. In practice, this is 15-25% of claims — the rest are resolved by the NLI model alone.

JUDGE_PROMPT = """You are a verification judge. For each claim below,
determine if the source evidence supports it.

Respond with: SUPPORTED, PARTIAL, or UNSUPPORTED.
Include one sentence of reasoning.

Claims: {claims}
Evidence: {evidence}"""

async def judge_borderline(borderline_claims: list[dict], evidence: list[str]) -> list[dict]:
    response = await claude.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1500,
        messages=[{"role": "user", "content": JUDGE_PROMPT.format(
            claims=json.dumps(borderline_claims),
            evidence="\n".join(evidence),
        )}],
    )
    return parse_judge_response(response.content[0].text)

Batching all borderline claims into a single LLM call keeps this at ~$0.02 and ~3 seconds.

Tier 4: Hard-Fail Gate

The final tier is binary. If any claim references a citation ID that does not exist in the knowledge base, or if the verification map contains any "Fabricated" classification, the response is blocked.

This is not a soft warning. It is an HTTP 422 with an incident log entry. The user sees "Verification failed" and the system logs the full state — which claims failed, which sources were checked, what the NLI scores were.

def apply_gate(verification_map: list[dict], kb_ruling_ids: set[str]) -> bool:
    for claim in verification_map:
        if claim.get("cited_source"):
            if claim["cited_source"] not in kb_ruling_ids:
                return False  # fabricated citation ID
        if claim["status"] == "fabricated":
            return False
    return True

The hard-fail gate is the structural guarantee. Every other tier can have edge cases and accuracy gaps. This tier is deterministic: if the ruling ID is not in the knowledge base, the response does not ship. Period.

Integrating with LangGraph

The verification layer is not a utility function called from business logic. It is a mandatory node in the LangGraph state machine. There is no code path that bypasses it.

from langgraph.graph import StateGraph, END

graph = StateGraph(AgentState)

graph.add_node("parse", parse_query)
graph.add_node("execute_skills", execute_skills)
graph.add_node("verify", run_verification_pipeline)
graph.add_node("compose", compose_result)
graph.add_node("hard_fail", log_incident_and_block)

graph.add_edge("parse", "execute_skills")
graph.add_edge("execute_skills", "verify")

graph.add_conditional_edges("verify", check_gate, {
    "pass": "compose",
    "fail": "hard_fail",
})

graph.add_edge("compose", END)
graph.add_edge("hard_fail", END)

Three properties make this trustworthy:

Verification is a node, not a middleware. It has its own state, its own error handling, and its own audit trail.
No edge connects execute_skills to compose directly. The graph structure makes bypassing verification impossible, not just unlikely.
Hard-fail is a first-class node. Incident logging happens in a dedicated state with its own persistence, not in a try/catch buried in business logic.

The state object carries the complete audit trail — retrieved chunks, LLM prompt, model version, NLI scores, judge reasoning, gate decision. Every query is fully reproducible.

The Economics

The tiered approach is dramatically cheaper than running full LLM verification on every claim:

Component	Accuracy	Cost / query	Latency
Claude claim extraction	~95%	$0.03	~2s
Local NLI cross-encoder	~92% NLI	$0.00	~0.5s
Claude judge (borderline)	~96%	$0.02	~3s
Total	~94% e2e	$0.05	~5.5s

Running Claude as judge on every claim (no NLI pre-filter) would cost $0.15+ per query. The NLI tier eliminates 75-85% of claims from the expensive path, saving $0.10 per query at scale.

At 1,000 queries/day: $50/day with the pipeline vs $150/day without. Over a month, the NLI model pays for its GPU instance many times over.

The local NLI model also eliminates an API dependency. If the Claude API has a latency spike, verification still runs — the NLI tier operates independently on local hardware.

Production Checklist

If you are building verification for a regulated-domain LLM system:

Start with the gate, not the model. The hard-fail on fabricated citation IDs is the cheapest, most reliable check. Implement it first.
Use NLI, not similarity. If your system makes factual claims against source documents, a cross-encoder trained for entailment is the right tool. Embedding similarity is for search, not verification.
Batch the expensive tier. Sending borderline claims to the LLM judge one at a time multiplies cost and latency. Batch them.
Make verification structural. If it is possible to bypass verification via a code path, someone will. Use a state machine where the graph topology enforces the flow.
Log everything. The verification map — claims, scores, sources, gate decision — is your audit trail. In regulated domains, reproducibility is not optional.

The verification layer I described here runs in production on a platform serving regulated professionals. The hard-fail gate has fired exactly twice in the first month — both on edge cases where the LLM referenced a superseded ruling that had been removed from the knowledge base. Both were caught before reaching any user.

Next in this series: Self-Hosted Mem0 for Persistent Agent Memory — how to store verified research findings so your agent builds institutional knowledge over time, instead of re-researching the same topics every session.