Hallucination Detection Benchmark
Measures the rate at which AI models fabricate facts — specifically, claims that are asserted with confidence but actively contradicted by retrieved evidence. Uses the production HalluHard NLI pipeline.
What It Measures#
The Hallucination Detection benchmark focuses on the worst category of AI error: confident fabrication. Not missing citations (which Claim Verification handles), but asserting things that are provably wrong — where retrieved evidence actively contradicts the claim the model made.
Hallucinations are particularly dangerous in research contexts because they read like facts. A model that says "the FDA approved drug X in 2021" when it was never approved, or "company Y reported $4.2B in revenue" when the actual figure was $1.8B, creates errors that propagate through reports and decisions. This benchmark measures how often this happens.
Contradiction Detection
Claims are classified as hallucinations only when retrieved evidence actively contradicts them — not just when evidence is absent.
Real Evidence Retrieval
Uses the same production NLI pipeline as Claim Verification — real URL fetching and web search, not pattern matching.
Confidence-Weighted
High-confidence language in contradicted claims is a separate signal — False Confidence is the most severe form of hallucination.
Difference from Claim Verification#
Hallucination Detection and Claim Verification use the same evaluation pipeline but ask different questions:
| Dimension | Claim Verification | Hallucination Detection |
|---|---|---|
| Primary Question | Are claims backed by sources? | Are claims actively wrong? |
| Primary Metric (lower = better) | Ungrounded claim rate | Hallucination Rate + False Confidence Rate |
| Primary Metric (higher = better) | Grounding Rate + Citation Coverage | Grounding Rate (same formula) |
| What Counts as Failure | NO_CITATION and CONTRADICTION | CONTRADICTION specifically |
| Focus | Source completeness — did the model cite its claims? | Factual accuracy — did the model assert false things? |
| Use Case | Select models for citation-heavy research outputs. | Detect models that fabricate facts regardless of confidence. |
Shared Infrastructure
evaluateTextForHallucination() function from the HalluHard benchmark evaluator and produce the same four-way verdict (ENTAILMENT, NEUTRAL, CONTRADICTION, NO_CITATION). They differ only in which metrics they surface as primary.The NLI Evaluation Pipeline#
Each model response goes through the same steps as Claim Verification:
Claim Extraction
Citation URL Extraction
Evidence Retrieval
NLI Entailment Analysis
Verdict Assignment
Score Computation
Verdict Classification#
| Verdict | NLI Condition | Hallucination Risk |
|---|---|---|
| ENTAILMENT | NLI score ≥ SOAR-V sensitivity threshold (default 0.75). Evidence confirms the claim. | None — claim is verified. |
| NEUTRAL | NLI score between 0.3 and threshold. Evidence exists but is inconclusive. | Low — claim may be accurate but is unverified. |
| CONTRADICTION | NLI score < 0.3. Retrieved evidence actively contradicts the claim text. | HIGH — this is a confirmed hallucination. |
| NO_CITATION | No URL present and web search returned insufficient evidence. Or: URL was inaccessible. | Medium — claim is unverified. May be accurate or fabricated. |
CONTRADICTION is the Key Signal
The Five Metrics#
Hallucination Rate#
The primary metric. Lower is better. Measures what fraction of all extracted claims were actively contradicted by retrieved evidence.
Formula
A claim is CONTRADICTION when NLI score < 0.3 after fetching and analyzing the cited or retrieved source. This means real evidence was found and it actively contradicts the claim.
Range: 0.0 (no hallucinations) → 1.0 (all claims contradicted). Lower = better. Leaderboard ranks models by ascending hallucination rate (best = lowest).
Grounding Rate#
Higher is better. Shares the same formula as Claim Verification. Measures what fraction of claims are positively entailed by retrieved evidence.
Formula
Range: 0.0 → 1.0. Higher = better.
Unsupported Claim Rate#
Measures what fraction of claims received NO_CITATION — no evidence was found or the URL was inaccessible. These claims are unverified but not necessarily false.
Formula
Range: 0.0 → 1.0. A high unsupported rate combined with a low hallucination rate indicates a model that makes many uncited but plausible claims. A high rate of both signals higher risk.
Contradiction Rate#
Identical to Hallucination Rate (CONTRADICTION count ÷ total claims). Kept as an explicit separate field in the rubric for downstream analytics and data pipelines.
False Confidence Rate#
The most severe form of hallucination. False Confidence measures what fraction of claims were stated with high-confidence language but actively contradicted by retrieved evidence. These are the claims most likely to mislead readers — the model didn't just get something wrong, it asserted it with certainty.
Formula
CONTRADICTION requires two conditions: (1) the NLI score is below 0.3 (evidence actively contradicts the claim), AND (2) the claim contains a high-confidence assertion marker. Because CONTRADICTION already requires high-confidence language in the evaluator, False Confidence Rate equals Contradiction Rate by construction.
Range: 0.0 (no false confidence) → 1.0 (all claims contradicted with high confidence). Lower is better.
High-confidence markers that trigger CONTRADICTION classification include: "exactly", "precisely", "definitively", "certainly", "definitely", "is a fact", "proven", "undeniably", "without question", "clearly", "obviously", "studies show that", "research confirms", "data shows", "statistics show".
Why This Lives in Hallucination Detection
Scoring Rubric#
All scores are stored in a HallucinationDetectionRubric JSON object inBenchmarkJudgment.rubricUsed:
| Field | Type | Description |
|---|---|---|
| benchmarkType | string | "HALLUCINATION_DETECTION" |
| scores.hallucinationRate | float (0–1) | CONTRADICTION ÷ total claims. Primary metric. |
| scores.groundingRate | float (0–1) | ENTAILMENT ÷ total claims. |
| scores.unsupportedClaimRate | float (0–1) | NO_CITATION ÷ total claims. |
| scores.contradictionRate | float (0–1) | Same as hallucinationRate. Kept explicit for pipelines. |
| scores.falseConfidenceRate | float (0–1) | CONTRADICTION (high-confidence) ÷ total claims. Equals contradictionRate by construction. |
| scores.claimCount | integer | Total claims extracted from response. |
| scores.groundedCount | integer | Claims that received ENTAILMENT. |
| scores.unsupportedCount | integer | Claims that received NO_CITATION. |
| scores.contradictedCount | integer | Claims that received CONTRADICTION. |
| scores.domain | string | Domain label (e.g. "general"). |
| evaluatedAt | ISO string | When the evaluation ran. |
| responseLength | integer | Character count of the model response. |
Overall Score
BenchmarkJudgment.overallScore for Hallucination Detection is stored as1 − hallucinationRate. This converts the "lower is better" hallucination rate into a "higher is better" quality score for consistent leaderboard ordering.Learning Signals#
Every CONTRADICTION and NO_CITATION verdict generates a BenchmarkLearningSignalwith benchmarkType = HALLUCINATION_DETECTION. The nightly learning loop uses these signals to:
1. Decrement routing weights for models with 3+ HALLUCINATION_DETECTION failures in a domain.
2. Raise the SOAR-V sensitivity threshold when HALLUCINATION_DETECTION + CLAIM_VERIFICATION combined failures ≥ 3, making production verification stricter until models improve.
3. Reward models with 7-day groundingRateAvg ≥ 80% with a routing weight increment.
Reading the Results#
Interpreting Hallucination Rates
- Hallucination Rate < 5%: Excellent. Very few confirmed fabrications. Safe for fact-critical research.
- Hallucination Rate 5–15%: Moderate. 1-in-10 to 1-in-7 claims may be contradicted. Review before publishing.
- Hallucination Rate > 15%: High risk. Avoid for research outputs without heavy editorial review.
- False Confidence > 10%: Model makes confident, contradicted assertions. Most dangerous pattern for research reports.
- Unsupported Claim Rate > 60%: Model rarely cites claims. Grounding cannot be established. Treat all outputs as unverified.
- Grounding Rate > 70% + Hallucination Rate < 5%: The gold standard — model verifies most claims and rarely contradicts evidence.