Skip to main content

Claim Verification Benchmark

Measures how well AI models ground factual claims in real, retrievable sources using the same production NLI (Natural Language Inference) pipeline that PromptReports.ai uses for live report verification.

What It Measures#

The Claim Verification benchmark tests whether AI models can produce factual claims that are genuinely supported by real, retrievable sources — not just plausible-sounding text. For each model response, the benchmark extracts individual factual claims, retrieves evidence for each one using real web searches and URL fetching, runs NLI (Natural Language Inference) to determine whether the evidence actually supports the claim, and produces two scores: Citation Coverage and Grounding Rate.

Why This Matters#

Standard AI benchmarks measure response quality or coherence — they do not check whether claims are actually true. A model that writes fluent, confident prose about fabricated statistics scores well on traditional benchmarks but fails catastrophically in research contexts.

Claim Verification tests the property that matters most for research reports: is this model producing claims that can be verified against real evidence? Models that score high on Citation Coverage and Grounding Rate produce reports with fewer corrections and stronger source backing.

Real Evidence Retrieval

The benchmark fetches actual URLs and runs web searches — no heuristics, no simulated data.

NLI Entailment

Each claim is evaluated using Neural Language Inference: does this source actually support this claim?

Two Dimensions

Citation Coverage and Grounding Rate give a complete picture of source fidelity. False Confidence is measured by the Hallucination Detection benchmark.

The NLI Evaluation Pipeline#

For each benchmark result, the following steps are executed by the HalluHard benchmark evaluator:

1

Claim Extraction

The model's response is split into sentences. Sentences containing numerical data, attributions ("according to"), or factual assertions (is/are/was/were/shows/indicates) are selected as candidate claims. Up to 20 claims are extracted per response.
2

Citation URL Extraction

All inline URLs (https://...) are extracted from the response. The benchmark looks for URLs that appear within 300 characters of each claim to associate them as the claim's intended citation.
3

Evidence Retrieval — Attempt 1 (Direct Fetch)

For claims with an associated citation URL, the URL is fetched directly. The retrieved page content (up to 2,000 characters) is passed to the NLI model for entailment analysis. Timeout: 8 seconds.
4

Evidence Retrieval — Attempt 2 (Corroborating Search)

For claims without a citation URL, or when the direct fetch fails, a web search is performed using key terms extracted from the claim. The best search result is retrieved and analyzed.
5

NLI Entailment Analysis

The retrieved text is compared against the claim using the CGA (Claim Grounding Analyzer). The CGA computes a similarity score (0–1). A score ≥ the SOAR-V sensitivity threshold (default 0.75) = ENTAILMENT.
6

Verdict Assignment

Each claim receives a four-way verdict: ENTAILMENT (grounded), CONTRADICTION (evidence contradicts the claim), NO_CITATION (no evidence found), or NEUTRAL (evidence is inconclusive).
7

Score Computation

Citation Coverage and Grounding Rate are computed from verdict counts and written to BenchmarkJudgment.rubricUsed as a ClaimVerificationRubric JSON object.

The Four-Way Verdict System#

VerdictMeaningEvidence ConditionCounts Toward
ENTAILMENTThe claim is supported by real, retrieved evidence.NLI score ≥ SOAR-V sensitivity threshold (default 0.75). Source was fetched and text entails the claim.Grounding Rate ↑ · Citation Coverage ↑
NEUTRALEvidence was found but does not clearly support or contradict the claim.NLI score between 0.3 and the threshold. Inconclusive evidence.Neither grounding nor failure
CONTRADICTIONThe model asserted this confidently but retrieved evidence contradicts it.NLI score < 0.3 AND the claim contains high-confidence language (e.g. "definitively", "proven", "certainly").False Confidence ↑ · Grounding Rate does not count
NO_CITATIONNo supporting evidence was found or the claim had no citation.No URL present and web search returned insufficient evidence. Or: URL was inaccessible (CITATION_INACCESSIBLE).Ungrounded count ↑

The Two Benchmark Metrics#

Citation Coverage#

Citation Coverage measures what fraction of extracted claims have any associated supporting source — whether the URL was inline in the response or found through a corroborating web search.

Citation Coverage penalizes models that make many assertions without providing any sources. A model that includes five citations in a 20-claim response has 25% coverage — very low. Cite more, cover more.

Grounding Rate#

Grounding Rate is the stricter metric. It measures what fraction of all claims are positively entailed by retrieved evidence — meaning the source was fetched, the text was analyzed, and the NLI model confirmed the claim is supported.

Grounding Rate > Citation Coverage is impossible — if a claim is ENTAILMENT, it also counted as covered. Grounding Rate ≤ Citation Coverage always. The gap between them reveals how many cited sources couldn't be verified (inaccessible URLs, paywalled content, etc.).

Scoring Rubric#

All three scores are stored in a ClaimVerificationRubric JSON object written to the BenchmarkJudgment.rubricUsed field after each evaluation. The rubric includes:

FieldTypeDescription
benchmarkTypestring"CLAIM_VERIFICATION"
scores.citationCoveragefloat (0–1)Fraction of claims with any supporting source
scores.groundingRatefloat (0–1)Fraction of claims positively entailed by evidence (NLI ≥ threshold)
scores.claimCountintegerTotal claims extracted from response
scores.groundedCountintegerClaims that received ENTAILMENT verdict
scores.citedCountintegerClaims with any evidence (ENTAILMENT or non-null evidence)
scores.ungroundedCountintegerClaims that received CONTRADICTION or NO_CITATION verdict
scores.domainstringDomain label for the prompt (e.g. "general", "medical")
evaluatedAtISO stringTimestamp when evaluation ran
responseLengthintegerCharacter count of the model response

Learning Signals#

Every CONTRADICTION and NO_CITATION verdict generates a BenchmarkLearningSignal record. These signals flow into the nightly learning loop which uses them to:

1. Decrement the routing weight of models with 3+ failures in a domain, reducing how often PromptReports.ai routes production research tasks to that model.

2. Raise the SOAR-V sensitivity threshold when total HALLUCINATION_DETECTION and CLAIM_VERIFICATION failures are high — requiring stronger evidence before accepting a claim as verified.

How to Read the Results#

View the live Claim Verification leaderboard →