Skip to main content

Hallucination Detection Benchmark

Measures the rate at which AI models fabricate facts — specifically, claims that are asserted with confidence but actively contradicted by retrieved evidence. Uses the production HalluHard NLI pipeline.

What It Measures#

The Hallucination Detection benchmark focuses on the worst category of AI error: confident fabrication. Not missing citations (which Claim Verification handles), but asserting things that are provably wrong — where retrieved evidence actively contradicts the claim the model made.

Hallucinations are particularly dangerous in research contexts because they read like facts. A model that says "the FDA approved drug X in 2021" when it was never approved, or "company Y reported $4.2B in revenue" when the actual figure was $1.8B, creates errors that propagate through reports and decisions. This benchmark measures how often this happens.

Contradiction Detection

Claims are classified as hallucinations only when retrieved evidence actively contradicts them — not just when evidence is absent.

Real Evidence Retrieval

Uses the same production NLI pipeline as Claim Verification — real URL fetching and web search, not pattern matching.

Confidence-Weighted

High-confidence language in contradicted claims is a separate signal — False Confidence is the most severe form of hallucination.

Difference from Claim Verification#

Hallucination Detection and Claim Verification use the same evaluation pipeline but ask different questions:

DimensionClaim VerificationHallucination Detection
Primary QuestionAre claims backed by sources?Are claims actively wrong?
Primary Metric (lower = better)Ungrounded claim rateHallucination Rate + False Confidence Rate
Primary Metric (higher = better)Grounding Rate + Citation CoverageGrounding Rate (same formula)
What Counts as FailureNO_CITATION and CONTRADICTIONCONTRADICTION specifically
FocusSource completeness — did the model cite its claims?Factual accuracy — did the model assert false things?
Use CaseSelect models for citation-heavy research outputs.Detect models that fabricate facts regardless of confidence.

The NLI Evaluation Pipeline#

Each model response goes through the same steps as Claim Verification:

1

Claim Extraction

Sentences containing numerical data, attributions, or factual assertions are extracted as candidate claims. Up to 20 claims per response are evaluated.
2

Citation URL Extraction

Inline URLs are extracted and associated with nearby claims (within 300 characters). Claims with a nearby URL are given a direct-fetch retrieval attempt.
3

Evidence Retrieval

For claims with a citation URL: direct fetch (Attempt 1). For claims without a URL: corroborating web search (Attempt 2). Timeout: 8 seconds per attempt. Parallel batches of 5 claims.
4

NLI Entailment Analysis

Retrieved content is compared against the claim using the CGA (Claim Grounding Analyzer). Scores ≥ SOAR-V threshold = ENTAILMENT. Scores < 0.3 = candidate for CONTRADICTION.
5

Verdict Assignment

ENTAILMENT: NLI confirms claim. CONTRADICTION: NLI contradicts claim (score < 0.3). NO_CITATION: no evidence found. NEUTRAL: inconclusive evidence.
6

Score Computation

Five metrics are computed from verdict counts: Hallucination Rate, Grounding Rate, Unsupported Claim Rate, Contradiction Rate, and False Confidence Rate. Stored in HallucinationDetectionRubric JSON.

Verdict Classification#

VerdictNLI ConditionHallucination Risk
ENTAILMENTNLI score ≥ SOAR-V sensitivity threshold (default 0.75). Evidence confirms the claim.None — claim is verified.
NEUTRALNLI score between 0.3 and threshold. Evidence exists but is inconclusive.Low — claim may be accurate but is unverified.
CONTRADICTIONNLI score < 0.3. Retrieved evidence actively contradicts the claim text.HIGH — this is a confirmed hallucination.
NO_CITATIONNo URL present and web search returned insufficient evidence. Or: URL was inaccessible.Medium — claim is unverified. May be accurate or fabricated.

The Five Metrics#

Hallucination Rate#

The primary metric. Lower is better. Measures what fraction of all extracted claims were actively contradicted by retrieved evidence.

Grounding Rate#

Higher is better. Shares the same formula as Claim Verification. Measures what fraction of claims are positively entailed by retrieved evidence.

Unsupported Claim Rate#

Measures what fraction of claims received NO_CITATION — no evidence was found or the URL was inaccessible. These claims are unverified but not necessarily false.

Contradiction Rate#

Identical to Hallucination Rate (CONTRADICTION count ÷ total claims). Kept as an explicit separate field in the rubric for downstream analytics and data pipelines.

False Confidence Rate#

The most severe form of hallucination. False Confidence measures what fraction of claims were stated with high-confidence language but actively contradicted by retrieved evidence. These are the claims most likely to mislead readers — the model didn't just get something wrong, it asserted it with certainty.

High-confidence markers that trigger CONTRADICTION classification include: "exactly", "precisely", "definitively", "certainly", "definitely", "is a fact", "proven", "undeniably", "without question", "clearly", "obviously", "studies show that", "research confirms", "data shows", "statistics show".

Scoring Rubric#

All scores are stored in a HallucinationDetectionRubric JSON object inBenchmarkJudgment.rubricUsed:

FieldTypeDescription
benchmarkTypestring"HALLUCINATION_DETECTION"
scores.hallucinationRatefloat (0–1)CONTRADICTION ÷ total claims. Primary metric.
scores.groundingRatefloat (0–1)ENTAILMENT ÷ total claims.
scores.unsupportedClaimRatefloat (0–1)NO_CITATION ÷ total claims.
scores.contradictionRatefloat (0–1)Same as hallucinationRate. Kept explicit for pipelines.
scores.falseConfidenceRatefloat (0–1)CONTRADICTION (high-confidence) ÷ total claims. Equals contradictionRate by construction.
scores.claimCountintegerTotal claims extracted from response.
scores.groundedCountintegerClaims that received ENTAILMENT.
scores.unsupportedCountintegerClaims that received NO_CITATION.
scores.contradictedCountintegerClaims that received CONTRADICTION.
scores.domainstringDomain label (e.g. "general").
evaluatedAtISO stringWhen the evaluation ran.
responseLengthintegerCharacter count of the model response.

Learning Signals#

Every CONTRADICTION and NO_CITATION verdict generates a BenchmarkLearningSignalwith benchmarkType = HALLUCINATION_DETECTION. The nightly learning loop uses these signals to:

1. Decrement routing weights for models with 3+ HALLUCINATION_DETECTION failures in a domain.

2. Raise the SOAR-V sensitivity threshold when HALLUCINATION_DETECTION + CLAIM_VERIFICATION combined failures ≥ 3, making production verification stricter until models improve.

3. Reward models with 7-day groundingRateAvg ≥ 80% with a routing weight increment.

Reading the Results#

View the live Hallucination Detection leaderboard →