Hallucination Detection Benchmark Leaderboard
Daily testing of AI model factual accuracy — measuring Hallucination Rate, Grounding Rate, and Unsupported Claim Rate across 50 research prompts using NLI verification.
Which AI model hallucinates the least? Find out with verified daily benchmarks.
Model Leaderboard
Rankings by Hallucination Rate (lowest = best). Rolling averages across all daily benchmark tests.
| # | Model | |||||||
|---|---|---|---|---|---|---|---|---|
Performance Insights
Discover hallucination trends, compare models across time periods, and find which AI is most factually reliable. Data updates as new benchmarks complete.
Recent Daily Results
Click any result to see per-model hallucination details
How It Works
Our automated HalluHard system ensures fair, consistent testing across all models
Domain-Calibrated Prompts
50 structured prompts across historical, scientific, economic, medical, and legal domains — each designed to elicit 4–8 verifiable factual claims per response.
Claim Extraction + NLI
Each model response is analyzed for factual claims. Every claim receives a verdict: ENTAILMENT (grounded), CONTRADICTION (hallucination), or NO_CITATION (unsupported).
Rolling Averages
Results are aggregated into rolling averages over 8 time periods (1d → all-time). Hallucination Rate = CONTRADICTION / total claims. Lower is better.
Use the model with the lowest hallucination rate
PromptReports.ai routes your research through the model with the lowest hallucination rate for your domain — verified daily by this benchmark.