Skip to main content
Updated Daily at 4:00 AM GMT

Hallucination Detection Benchmark Leaderboard

Daily testing of AI model factual accuracy — measuring Hallucination Rate, Grounding Rate, and Unsupported Claim Rate across 50 research prompts using NLI verification.

Which AI model hallucinates the least? Find out with verified daily benchmarks.

50
Research Prompts
Daily
Test Frequency
NLI
Verification Method
8
Time Periods

Model Leaderboard

Rankings by Hallucination Rate (lowest = best). Rolling averages across all daily benchmark tests.

#Model
Hallucination Rate (lower = better): ≤5% excellent ≤10% good ≤20% fair >20% poor· Tokens and Cost are per-run averages

Performance Insights

Discover hallucination trends, compare models across time periods, and find which AI is most factually reliable. Data updates as new benchmarks complete.

Recent Daily Results

Click any result to see per-model hallucination details

How It Works

Our automated HalluHard system ensures fair, consistent testing across all models

Domain-Calibrated Prompts

50 structured prompts across historical, scientific, economic, medical, and legal domains — each designed to elicit 4–8 verifiable factual claims per response.

Claim Extraction + NLI

Each model response is analyzed for factual claims. Every claim receives a verdict: ENTAILMENT (grounded), CONTRADICTION (hallucination), or NO_CITATION (unsupported).

Rolling Averages

Results are aggregated into rolling averages over 8 time periods (1d → all-time). Hallucination Rate = CONTRADICTION / total claims. Lower is better.

Use the model with the lowest hallucination rate

PromptReports.ai routes your research through the model with the lowest hallucination rate for your domain — verified daily by this benchmark.