AI Model Benchmark Suite
Five automated daily benchmarks that measure model quality, citation grounding, hallucination rate, source authority, and writing style — all using the same production verification pipeline that powers PromptReports.ai reports.
What are Daily Benchmarks?#
The PromptReports.ai benchmark suite runs five automated tests every day at midnight GMT. Each suite measures a different dimension of AI model quality using the same infrastructure that PromptReports.ai uses in production — real NLI-backed claim verification, RSI source authority scoring, and the stop-slop prose detector. All data is free and public.
5 Benchmark Suites
General Intelligence, Claim Verification, Source Tracing, Hallucination Detection, and Writing Quality.
Daily Testing
Fresh benchmark results every day at midnight GMT, automatically.
Production Methodology
Benchmark scoring uses the same NLI pipeline, RSI scoring, and stop-slop detector as live reports.
Self-Improving
Benchmark failure signals feed a nightly learning loop that tightens model routing and verification thresholds.
Open Data
The Five Benchmark Suites#
Each suite is independent — models are scored on different dimensions, and a model can rank highly in one suite while performing poorly in another. Use multiple suites together for a complete picture.
| Suite | Primary Question | Key Metric | Scale |
|---|---|---|---|
| General Intelligence | How good is the overall response quality? | Quality Score (LLM judge) | 0–100 |
| Claim Verification | Are the model's claims backed by real, verified sources? | Grounding Rate | 0–100% |
| Source Tracing | Are cited sources authoritative, valid, and aligned with claims? | Source Authority Score | 0–100 |
| Hallucination Detection | Does the model fabricate facts? | Hallucination Rate (lower = better) | 0–100% |
| Writing Quality | Is the prose clean, direct, and free of AI slop? | Pass Rate | 0–100% |
General Intelligence#
Tests 50 curated prompts spanning knowledge, reasoning, mathematics, coding, analysis, and creativity. A rotating daily prompt is sent to all active models simultaneously. An LLM judge (or multi-judge panel) scores each response on a detailed rubric covering relevance, accuracy, coherence, completeness, and depth. Latency, tokens/second, and cost are also recorded.
Read the full General Intelligence scoring guide →
Claim Verification#
Measures how well models ground factual claims in real, retrievable sources. Each model is prompted with a research question that requires citing specific URLs. The benchmark extracts claims from the response and runs them through the production HalluHard evidence retrieval pipeline — fetching cited URLs, running NLI entailment analysis, and classifying each claim as ENTAILMENT, CONTRADICTION, or NO_CITATION.
Three scores are produced: Citation Coverage, Grounding Rate, and False Confidence. These match exactly what PromptReports.ai measures on live Intelligence Briefings.
Read the full Claim Verification scoring guide →
Source Tracing#
Evaluates whether models can find and cite authoritative, live sources for domain-specific research questions (legal, scientific, financial, medical, and policy domains). URLs are extracted from responses, validated against live servers, scored using the RSI (Reference Source Index) for domain authority, and checked for citation alignment against the original claims via a lightweight LLM judge.
Read the full Source Tracing scoring guide →
Hallucination Detection#
Uses the same production NLI pipeline as Claim Verification but focuses specifically on detecting fabricated facts. Claims that are asserted confidently but contradicted by retrieved evidence are classified as CONTRADICTION — the hallucination signal. The hallucination rate is the fraction of total claims that fall into this category (lower is better).
Read the full Hallucination Detection scoring guide →
Writing Quality#
Runs 50 fixed test prompts through the stop-slop prose detector — PromptReports.ai's internal filter that catches AI writing patterns that reduce trust and readability. Prompts are split between intentionally bad samples (hedging, filler phrases, generic adjectives, passive voice, em-dash overuse, WH-openers, binary contrasts) and clean analyst-style controls. Models pass when their output scores above 70/100 on a five-dimension rubric: Directness, Rhythm, Trust, Authenticity, and Density.
Read the full Writing Quality scoring guide →
How the Learning Loop Works#
Every night after benchmarks complete, a learning loop processes all unresolved failure signals and updates the verification pipeline automatically:
Failure Signals Collected
Model Routing Updated
Verification Thresholds Adjusted
Stop-Slop Patterns Updated
Audit Log Written
Safety Bounds
Time Periods#
All leaderboards support eight rolling average windows:
| Period | Best For |
|---|---|
| 1 Day | Most recent single benchmark |
| 1 Week (7d) | Catching recent regressions or improvements |
| 30 Days | Stable comparison — recommended starting point |
| 60 Days | Medium-term reliability view |
| 90 Days | Quarterly patterns |
| 180 Days | Six-month consistency |
| 1 Year (365d) | Year-over-year comparison |
| All Time | Full historical track record |
Data Freshness & Reliability#
Benchmark data is generated fresh every day at midnight GMT. Results are available within 30 minutes of the run completing. Historical data is retained indefinitely — all past benchmark runs remain accessible via the leaderboard time period selector.
| Guarantee | Detail |
|---|---|
| Update frequency | Daily at midnight GMT, 7 days a week |
| Availability | Results published within 30 minutes of run completion |
| Data retention | All historical results retained indefinitely |
| Methodology consistency | Scoring rubrics and pipeline versions are pinned per run and recorded in metadata |
| Reproducibility | Each run records model version, prompt hash, and pipeline config for full reproducibility |
| Staleness indicator | Leaderboard shows "Last updated X hours ago" badge; stale data (>36h) is flagged |
Using Benchmark Data#
Choose the Right Model
Use Claim Verification and Hallucination Detection to shortlist models for research tasks where accuracy matters.
Track Reliability Over Time
Use 30-day or 90-day views to identify models with consistent quality, not just good days.
Optimize Cost
Cross-reference the General Intelligence quality score against cost-per-1k to find the best value model for your volume.
Detect Regressions
Watch the 7-day leaderboard for sudden drops — model providers update weights without announcements.