Hallucination Detection Benchmark

Measures the rate at which AI models fabricate facts — specifically, claims that are asserted with confidence but actively contradicted by retrieved evidence. Uses the production HalluHard NLI pipeline.

What It Measures#

The Hallucination Detection benchmark focuses on the worst category of AI error: confident fabrication. Not missing citations (which Claim Verification handles), but asserting things that are provably wrong — where retrieved evidence actively contradicts the claim the model made.

Hallucinations are particularly dangerous in research contexts because they read like facts. A model that says "the FDA approved drug X in 2021" when it was never approved, or "company Y reported $4.2B in revenue" when the actual figure was $1.8B, creates errors that propagate through reports and decisions. This benchmark measures how often this happens.

Contradiction Detection

Claims are classified as hallucinations only when retrieved evidence actively contradicts them — not just when evidence is absent.

Real Evidence Retrieval

Uses the same production NLI pipeline as Claim Verification — real URL fetching and web search, not pattern matching.

Confidence-Weighted

High-confidence language in contradicted claims is a separate signal — False Confidence is the most severe form of hallucination.

Difference from Claim Verification#

Hallucination Detection and Claim Verification use the same evaluation pipeline but ask different questions:

Dimension	Claim Verification	Hallucination Detection
Primary Question	Are claims backed by sources?	Are claims actively wrong?
Primary Metric (lower = better)	Ungrounded claim rate	Hallucination Rate + False Confidence Rate
Primary Metric (higher = better)	Grounding Rate + Citation Coverage	Grounding Rate (same formula)
What Counts as Failure	NO_CITATION and CONTRADICTION	CONTRADICTION specifically
Focus	Source completeness — did the model cite its claims?	Factual accuracy — did the model assert false things?
Use Case	Select models for citation-heavy research outputs.	Detect models that fabricate facts regardless of confidence.

Shared Infrastructure

Both benchmarks use the same evaluateTextForHallucination() function from the HalluHard benchmark evaluator and produce the same four-way verdict (ENTAILMENT, NEUTRAL, CONTRADICTION, NO_CITATION). They differ only in which metrics they surface as primary.

The NLI Evaluation Pipeline#

Each model response goes through the same steps as Claim Verification:

Claim Extraction

Sentences containing numerical data, attributions, or factual assertions are extracted as candidate claims. Up to 20 claims per response are evaluated.

Citation URL Extraction

Inline URLs are extracted and associated with nearby claims (within 300 characters). Claims with a nearby URL are given a direct-fetch retrieval attempt.

Evidence Retrieval

For claims with a citation URL: direct fetch (Attempt 1). For claims without a URL: corroborating web search (Attempt 2). Timeout: 8 seconds per attempt. Parallel batches of 5 claims.

NLI Entailment Analysis

Retrieved content is compared against the claim using the CGA (Claim Grounding Analyzer). Scores ≥ SOAR-V threshold = ENTAILMENT. Scores < 0.3 = candidate for CONTRADICTION.

Verdict Assignment

ENTAILMENT: NLI confirms claim. CONTRADICTION: NLI contradicts claim (score < 0.3). NO_CITATION: no evidence found. NEUTRAL: inconclusive evidence.

Score Computation

Five metrics are computed from verdict counts: Hallucination Rate, Grounding Rate, Unsupported Claim Rate, Contradiction Rate, and False Confidence Rate. Stored in HallucinationDetectionRubric JSON.

Verdict Classification#

Verdict	NLI Condition	Hallucination Risk
ENTAILMENT	NLI score ≥ SOAR-V sensitivity threshold (default 0.75). Evidence confirms the claim.	None — claim is verified.
NEUTRAL	NLI score between 0.3 and threshold. Evidence exists but is inconclusive.	Low — claim may be accurate but is unverified.
CONTRADICTION	NLI score < 0.3. Retrieved evidence actively contradicts the claim text.	HIGH — this is a confirmed hallucination.
NO_CITATION	No URL present and web search returned insufficient evidence. Or: URL was inaccessible.	Medium — claim is unverified. May be accurate or fabricated.

CONTRADICTION is the Key Signal

NO_CITATION claims may be accurate — the model just didn't cite them. CONTRADICTION claims are almost certainly wrong: the model made a specific assertion, and retrieved evidence from a real source contradicts it. This is the hallucination signal.

The Five Metrics#

Hallucination Rate#

The primary metric. Lower is better. Measures what fraction of all extracted claims were actively contradicted by retrieved evidence.

Formula

Hallucination Rate = CONTRADICTION count ÷ Total claim count

A claim is CONTRADICTION when NLI score < 0.3 after fetching and analyzing the cited or retrieved source. This means real evidence was found and it actively contradicts the claim.

Range: 0.0 (no hallucinations) → 1.0 (all claims contradicted). Lower = better. Leaderboard ranks models by ascending hallucination rate (best = lowest).

Grounding Rate#

Higher is better. Shares the same formula as Claim Verification. Measures what fraction of claims are positively entailed by retrieved evidence.

Formula

Grounding Rate = ENTAILMENT count ÷ Total claim count

Range: 0.0 → 1.0. Higher = better.

Unsupported Claim Rate#

Measures what fraction of claims received NO_CITATION — no evidence was found or the URL was inaccessible. These claims are unverified but not necessarily false.

Formula

Unsupported Claim Rate = NO_CITATION count ÷ Total claim count

Range: 0.0 → 1.0. A high unsupported rate combined with a low hallucination rate indicates a model that makes many uncited but plausible claims. A high rate of both signals higher risk.

Contradiction Rate#

Identical to Hallucination Rate (CONTRADICTION count ÷ total claims). Kept as an explicit separate field in the rubric for downstream analytics and data pipelines.

False Confidence Rate#

The most severe form of hallucination. False Confidence measures what fraction of claims were stated with high-confidence language but actively contradicted by retrieved evidence. These are the claims most likely to mislead readers — the model didn't just get something wrong, it asserted it with certainty.

Formula

False Confidence Rate = CONTRADICTION count ÷ Total claim count

CONTRADICTION requires two conditions: (1) the NLI score is below 0.3 (evidence actively contradicts the claim), AND (2) the claim contains a high-confidence assertion marker. Because CONTRADICTION already requires high-confidence language in the evaluator, False Confidence Rate equals Contradiction Rate by construction.

Range: 0.0 (no false confidence) → 1.0 (all claims contradicted with high confidence). Lower is better.

High-confidence markers that trigger CONTRADICTION classification include: "exactly", "precisely", "definitively", "certainly", "definitely", "is a fact", "proven", "undeniably", "without question", "clearly", "obviously", "studies show that", "research confirms", "data shows", "statistics show".

Why This Lives in Hallucination Detection

False Confidence was moved here from Claim Verification because this benchmark uses hallucination-inducing prompts designed to elicit overconfident wrong answers. The Claim Verification benchmark uses factual queries with accurate Tavily sources — models on those prompts rarely produce CONTRADICTION verdicts because the sourced evidence matches what the model reports. False Confidence is only a meaningful signal when prompts are designed to stress-test model calibration.

Scoring Rubric#

All scores are stored in a HallucinationDetectionRubric JSON object inBenchmarkJudgment.rubricUsed:

Field	Type	Description
benchmarkType	string	"HALLUCINATION_DETECTION"
scores.hallucinationRate	float (0–1)	CONTRADICTION ÷ total claims. Primary metric.
scores.groundingRate	float (0–1)	ENTAILMENT ÷ total claims.
scores.unsupportedClaimRate	float (0–1)	NO_CITATION ÷ total claims.
scores.contradictionRate	float (0–1)	Same as hallucinationRate. Kept explicit for pipelines.
scores.falseConfidenceRate	float (0–1)	CONTRADICTION (high-confidence) ÷ total claims. Equals contradictionRate by construction.
scores.claimCount	integer	Total claims extracted from response.
scores.groundedCount	integer	Claims that received ENTAILMENT.
scores.unsupportedCount	integer	Claims that received NO_CITATION.
scores.contradictedCount	integer	Claims that received CONTRADICTION.
scores.domain	string	Domain label (e.g. "general").
evaluatedAt	ISO string	When the evaluation ran.
responseLength	integer	Character count of the model response.

Overall Score

The BenchmarkJudgment.overallScore for Hallucination Detection is stored as1 − hallucinationRate. This converts the "lower is better" hallucination rate into a "higher is better" quality score for consistent leaderboard ordering.

Learning Signals#

Every CONTRADICTION and NO_CITATION verdict generates a BenchmarkLearningSignalwith benchmarkType = HALLUCINATION_DETECTION. The nightly learning loop uses these signals to:

1. Decrement routing weights for models with 3+ HALLUCINATION_DETECTION failures in a domain.

2. Raise the SOAR-V sensitivity threshold when HALLUCINATION_DETECTION + CLAIM_VERIFICATION combined failures ≥ 3, making production verification stricter until models improve.

3. Reward models with 7-day groundingRateAvg ≥ 80% with a routing weight increment.

Reading the Results#

Interpreting Hallucination Rates

Hallucination Rate < 5%: Excellent. Very few confirmed fabrications. Safe for fact-critical research.
Hallucination Rate 5–15%: Moderate. 1-in-10 to 1-in-7 claims may be contradicted. Review before publishing.
Hallucination Rate > 15%: High risk. Avoid for research outputs without heavy editorial review.
False Confidence > 10%: Model makes confident, contradicted assertions. Most dangerous pattern for research reports.
Unsupported Claim Rate > 60%: Model rarely cites claims. Grounding cannot be established. Treat all outputs as unverified.
Grounding Rate > 70% + Hallucination Rate < 5%: The gold standard — model verifies most claims and rarely contradicts evidence.

View the live Hallucination Detection leaderboard →

PreviousSource Tracing

NextWriting Quality