Claim Verification Benchmark

Measures how well AI models ground factual claims in real, retrievable sources using the same production NLI (Natural Language Inference) pipeline that PromptReports.ai uses for live report verification.

What It Measures#

The Claim Verification benchmark tests whether AI models can produce factual claims that are genuinely supported by real, retrievable sources — not just plausible-sounding text. For each model response, the benchmark extracts individual factual claims, retrieves evidence for each one using real web searches and URL fetching, runs NLI (Natural Language Inference) to determine whether the evidence actually supports the claim, and produces two scores: Citation Coverage and Grounding Rate.

Same Pipeline as Live Reports

The benchmark uses the exact same HalluHard evaluation engine that runs on PromptReports.ai Intelligence Briefings and reports. Scores on the benchmark directly predict how a model will perform in production verification.

Why This Matters#

Standard AI benchmarks measure response quality or coherence — they do not check whether claims are actually true. A model that writes fluent, confident prose about fabricated statistics scores well on traditional benchmarks but fails catastrophically in research contexts.

Claim Verification tests the property that matters most for research reports: is this model producing claims that can be verified against real evidence? Models that score high on Citation Coverage and Grounding Rate produce reports with fewer corrections and stronger source backing.

Real Evidence Retrieval

The benchmark fetches actual URLs and runs web searches — no heuristics, no simulated data.

NLI Entailment

Each claim is evaluated using Neural Language Inference: does this source actually support this claim?

Two Dimensions

Citation Coverage and Grounding Rate give a complete picture of source fidelity. False Confidence is measured by the Hallucination Detection benchmark.

The NLI Evaluation Pipeline#

For each benchmark result, the following steps are executed by the HalluHard benchmark evaluator:

Claim Extraction

The model's response is split into sentences. Sentences containing numerical data, attributions ("according to"), or factual assertions (is/are/was/were/shows/indicates) are selected as candidate claims. Up to 20 claims are extracted per response.

Citation URL Extraction

All inline URLs (https://...) are extracted from the response. The benchmark looks for URLs that appear within 300 characters of each claim to associate them as the claim's intended citation.

Evidence Retrieval — Attempt 1 (Direct Fetch)

For claims with an associated citation URL, the URL is fetched directly. The retrieved page content (up to 2,000 characters) is passed to the NLI model for entailment analysis. Timeout: 8 seconds.

Evidence Retrieval — Attempt 2 (Corroborating Search)

For claims without a citation URL, or when the direct fetch fails, a web search is performed using key terms extracted from the claim. The best search result is retrieved and analyzed.

NLI Entailment Analysis

The retrieved text is compared against the claim using the CGA (Claim Grounding Analyzer). The CGA computes a similarity score (0–1). A score ≥ the SOAR-V sensitivity threshold (default 0.75) = ENTAILMENT.

Verdict Assignment

Each claim receives a four-way verdict: ENTAILMENT (grounded), CONTRADICTION (evidence contradicts the claim), NO_CITATION (no evidence found), or NEUTRAL (evidence is inconclusive).

Score Computation

Citation Coverage and Grounding Rate are computed from verdict counts and written to BenchmarkJudgment.rubricUsed as a ClaimVerificationRubric JSON object.

The Four-Way Verdict System#

Verdict	Meaning	Evidence Condition	Counts Toward
ENTAILMENT	The claim is supported by real, retrieved evidence.	NLI score ≥ SOAR-V sensitivity threshold (default 0.75). Source was fetched and text entails the claim.	Grounding Rate ↑ · Citation Coverage ↑
NEUTRAL	Evidence was found but does not clearly support or contradict the claim.	NLI score between 0.3 and the threshold. Inconclusive evidence.	Neither grounding nor failure
CONTRADICTION	The model asserted this confidently but retrieved evidence contradicts it.	NLI score < 0.3 AND the claim contains high-confidence language (e.g. "definitively", "proven", "certainly").	False Confidence ↑ · Grounding Rate does not count
NO_CITATION	No supporting evidence was found or the claim had no citation.	No URL present and web search returned insufficient evidence. Or: URL was inaccessible (CITATION_INACCESSIBLE).	Ungrounded count ↑

SOAR-V Sensitivity Threshold

The entailment threshold starts at 0.75 and is adjusted nightly by the learning loop. When hallucination signals are high, the threshold rises (stricter verification, more claims require stronger evidence). When signals are absent, it drops slightly. This means benchmark scores reflect the current production verification standards.

The Two Benchmark Metrics#

Citation Coverage#

Citation Coverage measures what fraction of extracted claims have any associated supporting source — whether the URL was inline in the response or found through a corroborating web search.

Formula

Citation Coverage = (ENTAILMENT count + claims with any evidence) ÷ Total claim count

A claim "has evidence" if its verdict is ENTAILMENT or if it has a non-null evidence URL, even if the NLI analysis was inconclusive.

Range: 0.0 (no claims have any source) → 1.0 (every claim has at least one source). Displayed as a percentage on the leaderboard (e.g. 0.82 = 82%).

Citation Coverage penalizes models that make many assertions without providing any sources. A model that includes five citations in a 20-claim response has 25% coverage — very low. Cite more, cover more.

Grounding Rate#

Grounding Rate is the stricter metric. It measures what fraction of all claims are positively entailed by retrieved evidence — meaning the source was fetched, the text was analyzed, and the NLI model confirmed the claim is supported.

Formula

Grounding Rate = ENTAILMENT count ÷ Total claim count

A claim is ENTAILMENT if and only if the NLI entailment score ≥ SOAR-V sensitivity threshold (typically 0.75) after fetching and analyzing the cited or retrieved source.

Range: 0.0 (nothing grounded) → 1.0 (every claim verified against a real source). Displayed as a percentage on the leaderboard.

Grounding Rate > Citation Coverage is impossible — if a claim is ENTAILMENT, it also counted as covered. Grounding Rate ≤ Citation Coverage always. The gap between them reveals how many cited sources couldn't be verified (inaccessible URLs, paywalled content, etc.).

False Confidence is in Hallucination Detection

False Confidence (CONTRADICTION with high-confidence language) is measured by the Hallucination Detection benchmark, which uses hallucination-inducing prompts designed to elicit overconfident wrong answers. It is not measured here because well-calibrated models citing accurate Tavily sources rarely produce contradictions on factual queries.

Scoring Rubric#

All three scores are stored in a ClaimVerificationRubric JSON object written to the BenchmarkJudgment.rubricUsed field after each evaluation. The rubric includes:

Field	Type	Description
benchmarkType	string	"CLAIM_VERIFICATION"
scores.citationCoverage	float (0–1)	Fraction of claims with any supporting source
scores.groundingRate	float (0–1)	Fraction of claims positively entailed by evidence (NLI ≥ threshold)
scores.claimCount	integer	Total claims extracted from response
scores.groundedCount	integer	Claims that received ENTAILMENT verdict
scores.citedCount	integer	Claims with any evidence (ENTAILMENT or non-null evidence)
scores.ungroundedCount	integer	Claims that received CONTRADICTION or NO_CITATION verdict
scores.domain	string	Domain label for the prompt (e.g. "general", "medical")
evaluatedAt	ISO string	Timestamp when evaluation ran
responseLength	integer	Character count of the model response

Learning Signals#

Every CONTRADICTION and NO_CITATION verdict generates a BenchmarkLearningSignal record. These signals flow into the nightly learning loop which uses them to:

1. Decrement the routing weight of models with 3+ failures in a domain, reducing how often PromptReports.ai routes production research tasks to that model.

2. Raise the SOAR-V sensitivity threshold when total HALLUCINATION_DETECTION and CLAIM_VERIFICATION failures are high — requiring stronger evidence before accepting a claim as verified.

How to Read the Results#

Interpreting Scores

Grounding Rate > 70%: Model reliably backs claims with real sources. Safe for research use.
Grounding Rate 40–70%: Mixed. Use for drafting but verify important claims manually.
Grounding Rate < 40%: Most claims unverified. High hallucination risk for research tasks.
Citation Coverage > Grounding Rate by >20%: Model cites sources that don't actually support its claims.
For False Confidence scores: See the Hallucination Detection benchmark.

View the live Claim Verification leaderboard →

PreviousGeneral Intelligence

NextSource Tracing