Claim Verification Benchmark
Measures how well AI models ground factual claims in real, retrievable sources using the same production NLI (Natural Language Inference) pipeline that PromptReports.ai uses for live report verification.
What It Measures#
The Claim Verification benchmark tests whether AI models can produce factual claims that are genuinely supported by real, retrievable sources — not just plausible-sounding text. For each model response, the benchmark extracts individual factual claims, retrieves evidence for each one using real web searches and URL fetching, runs NLI (Natural Language Inference) to determine whether the evidence actually supports the claim, and produces two scores: Citation Coverage and Grounding Rate.
Same Pipeline as Live Reports
Why This Matters#
Standard AI benchmarks measure response quality or coherence — they do not check whether claims are actually true. A model that writes fluent, confident prose about fabricated statistics scores well on traditional benchmarks but fails catastrophically in research contexts.
Claim Verification tests the property that matters most for research reports: is this model producing claims that can be verified against real evidence? Models that score high on Citation Coverage and Grounding Rate produce reports with fewer corrections and stronger source backing.
Real Evidence Retrieval
The benchmark fetches actual URLs and runs web searches — no heuristics, no simulated data.
NLI Entailment
Each claim is evaluated using Neural Language Inference: does this source actually support this claim?
Two Dimensions
Citation Coverage and Grounding Rate give a complete picture of source fidelity. False Confidence is measured by the Hallucination Detection benchmark.
The NLI Evaluation Pipeline#
For each benchmark result, the following steps are executed by the HalluHard benchmark evaluator:
Claim Extraction
Citation URL Extraction
Evidence Retrieval — Attempt 1 (Direct Fetch)
Evidence Retrieval — Attempt 2 (Corroborating Search)
NLI Entailment Analysis
Verdict Assignment
Score Computation
The Four-Way Verdict System#
| Verdict | Meaning | Evidence Condition | Counts Toward |
|---|---|---|---|
| ENTAILMENT | The claim is supported by real, retrieved evidence. | NLI score ≥ SOAR-V sensitivity threshold (default 0.75). Source was fetched and text entails the claim. | Grounding Rate ↑ · Citation Coverage ↑ |
| NEUTRAL | Evidence was found but does not clearly support or contradict the claim. | NLI score between 0.3 and the threshold. Inconclusive evidence. | Neither grounding nor failure |
| CONTRADICTION | The model asserted this confidently but retrieved evidence contradicts it. | NLI score < 0.3 AND the claim contains high-confidence language (e.g. "definitively", "proven", "certainly"). | False Confidence ↑ · Grounding Rate does not count |
| NO_CITATION | No supporting evidence was found or the claim had no citation. | No URL present and web search returned insufficient evidence. Or: URL was inaccessible (CITATION_INACCESSIBLE). | Ungrounded count ↑ |
SOAR-V Sensitivity Threshold
The Two Benchmark Metrics#
Citation Coverage#
Citation Coverage measures what fraction of extracted claims have any associated supporting source — whether the URL was inline in the response or found through a corroborating web search.
Formula
A claim "has evidence" if its verdict is ENTAILMENT or if it has a non-null evidence URL, even if the NLI analysis was inconclusive.
Range: 0.0 (no claims have any source) → 1.0 (every claim has at least one source). Displayed as a percentage on the leaderboard (e.g. 0.82 = 82%).
Citation Coverage penalizes models that make many assertions without providing any sources. A model that includes five citations in a 20-claim response has 25% coverage — very low. Cite more, cover more.
Grounding Rate#
Grounding Rate is the stricter metric. It measures what fraction of all claims are positively entailed by retrieved evidence — meaning the source was fetched, the text was analyzed, and the NLI model confirmed the claim is supported.
Formula
A claim is ENTAILMENT if and only if the NLI entailment score ≥ SOAR-V sensitivity threshold (typically 0.75) after fetching and analyzing the cited or retrieved source.
Range: 0.0 (nothing grounded) → 1.0 (every claim verified against a real source). Displayed as a percentage on the leaderboard.
Grounding Rate > Citation Coverage is impossible — if a claim is ENTAILMENT, it also counted as covered. Grounding Rate ≤ Citation Coverage always. The gap between them reveals how many cited sources couldn't be verified (inaccessible URLs, paywalled content, etc.).
False Confidence is in Hallucination Detection
Scoring Rubric#
All three scores are stored in a ClaimVerificationRubric JSON object written to the BenchmarkJudgment.rubricUsed field after each evaluation. The rubric includes:
| Field | Type | Description |
|---|---|---|
| benchmarkType | string | "CLAIM_VERIFICATION" |
| scores.citationCoverage | float (0–1) | Fraction of claims with any supporting source |
| scores.groundingRate | float (0–1) | Fraction of claims positively entailed by evidence (NLI ≥ threshold) |
| scores.claimCount | integer | Total claims extracted from response |
| scores.groundedCount | integer | Claims that received ENTAILMENT verdict |
| scores.citedCount | integer | Claims with any evidence (ENTAILMENT or non-null evidence) |
| scores.ungroundedCount | integer | Claims that received CONTRADICTION or NO_CITATION verdict |
| scores.domain | string | Domain label for the prompt (e.g. "general", "medical") |
| evaluatedAt | ISO string | Timestamp when evaluation ran |
| responseLength | integer | Character count of the model response |
Learning Signals#
Every CONTRADICTION and NO_CITATION verdict generates a BenchmarkLearningSignal record. These signals flow into the nightly learning loop which uses them to:
1. Decrement the routing weight of models with 3+ failures in a domain, reducing how often PromptReports.ai routes production research tasks to that model.
2. Raise the SOAR-V sensitivity threshold when total HALLUCINATION_DETECTION and CLAIM_VERIFICATION failures are high — requiring stronger evidence before accepting a claim as verified.
How to Read the Results#
Interpreting Scores
- Grounding Rate > 70%: Model reliably backs claims with real sources. Safe for research use.
- Grounding Rate 40–70%: Mixed. Use for drafting but verify important claims manually.
- Grounding Rate < 40%: Most claims unverified. High hallucination risk for research tasks.
- Citation Coverage > Grounding Rate by >20%: Model cites sources that don't actually support its claims.
- For False Confidence scores: See the Hallucination Detection benchmark.