Skip to main content

AI Model Benchmark Suite

Five automated daily benchmarks that measure model quality, citation grounding, hallucination rate, source authority, and writing style — all using the same production verification pipeline that powers PromptReports.ai reports.

What are Daily Benchmarks?#

The PromptReports.ai benchmark suite runs five automated tests every day at midnight GMT. Each suite measures a different dimension of AI model quality using the same infrastructure that PromptReports.ai uses in production — real NLI-backed claim verification, RSI source authority scoring, and the stop-slop prose detector. All data is free and public.

5 Benchmark Suites

General Intelligence, Claim Verification, Source Tracing, Hallucination Detection, and Writing Quality.

Daily Testing

Fresh benchmark results every day at midnight GMT, automatically.

Production Methodology

Benchmark scoring uses the same NLI pipeline, RSI scoring, and stop-slop detector as live reports.

Self-Improving

Benchmark failure signals feed a nightly learning loop that tightens model routing and verification thresholds.

The Five Benchmark Suites#

Each suite is independent — models are scored on different dimensions, and a model can rank highly in one suite while performing poorly in another. Use multiple suites together for a complete picture.

SuitePrimary QuestionKey MetricScale
General IntelligenceHow good is the overall response quality?Quality Score (LLM judge)0–100
Claim VerificationAre the model's claims backed by real, verified sources?Grounding Rate0–100%
Source TracingAre cited sources authoritative, valid, and aligned with claims?Source Authority Score0–100
Hallucination DetectionDoes the model fabricate facts?Hallucination Rate (lower = better)0–100%
Writing QualityIs the prose clean, direct, and free of AI slop?Pass Rate0–100%

General Intelligence#

Tests 50 curated prompts spanning knowledge, reasoning, mathematics, coding, analysis, and creativity. A rotating daily prompt is sent to all active models simultaneously. An LLM judge (or multi-judge panel) scores each response on a detailed rubric covering relevance, accuracy, coherence, completeness, and depth. Latency, tokens/second, and cost are also recorded.

Read the full General Intelligence scoring guide →

Claim Verification#

Measures how well models ground factual claims in real, retrievable sources. Each model is prompted with a research question that requires citing specific URLs. The benchmark extracts claims from the response and runs them through the production HalluHard evidence retrieval pipeline — fetching cited URLs, running NLI entailment analysis, and classifying each claim as ENTAILMENT, CONTRADICTION, or NO_CITATION.

Three scores are produced: Citation Coverage, Grounding Rate, and False Confidence. These match exactly what PromptReports.ai measures on live Intelligence Briefings.

Read the full Claim Verification scoring guide →

Source Tracing#

Evaluates whether models can find and cite authoritative, live sources for domain-specific research questions (legal, scientific, financial, medical, and policy domains). URLs are extracted from responses, validated against live servers, scored using the RSI (Reference Source Index) for domain authority, and checked for citation alignment against the original claims via a lightweight LLM judge.

Read the full Source Tracing scoring guide →

Hallucination Detection#

Uses the same production NLI pipeline as Claim Verification but focuses specifically on detecting fabricated facts. Claims that are asserted confidently but contradicted by retrieved evidence are classified as CONTRADICTION — the hallucination signal. The hallucination rate is the fraction of total claims that fall into this category (lower is better).

Read the full Hallucination Detection scoring guide →

Writing Quality#

Runs 50 fixed test prompts through the stop-slop prose detector — PromptReports.ai's internal filter that catches AI writing patterns that reduce trust and readability. Prompts are split between intentionally bad samples (hedging, filler phrases, generic adjectives, passive voice, em-dash overuse, WH-openers, binary contrasts) and clean analyst-style controls. Models pass when their output scores above 70/100 on a five-dimension rubric: Directness, Rhythm, Trust, Authenticity, and Density.

Read the full Writing Quality scoring guide →

How the Learning Loop Works#

Every night after benchmarks complete, a learning loop processes all unresolved failure signals and updates the verification pipeline automatically:

1

Failure Signals Collected

Each benchmark runner emits BenchmarkLearningSignal records for every CONTRADICTION, NO_CITATION, LOW_URL_VALIDITY, and LOW_AUTHORITY_SCORE failure.
2

Model Routing Updated

Models with 3+ failures in a domain have their routing priority weight decremented (-0.1, min 0.3). Models whose 7-day grounding rate exceeds 80% earn a weight increment (+0.1, max 2.0).
3

Verification Thresholds Adjusted

The SOAR-V sensitivity threshold rises when hallucination signals are high (stricter verification) and relaxes when they are absent. The grounding threshold rises when low-authority sources proliferate.
4

Stop-Slop Patterns Updated

New violation patterns from Writing Quality failures are added to the StopSlopPattern library for detection in future reports.
5

Audit Log Written

Every loop run produces a LearningLoopAuditLog entry: signals processed, routing changes, config updates, and safety skips.

Time Periods#

All leaderboards support eight rolling average windows:

PeriodBest For
1 DayMost recent single benchmark
1 Week (7d)Catching recent regressions or improvements
30 DaysStable comparison — recommended starting point
60 DaysMedium-term reliability view
90 DaysQuarterly patterns
180 DaysSix-month consistency
1 Year (365d)Year-over-year comparison
All TimeFull historical track record

Data Freshness & Reliability#

Benchmark data is generated fresh every day at midnight GMT. Results are available within 30 minutes of the run completing. Historical data is retained indefinitely — all past benchmark runs remain accessible via the leaderboard time period selector.

GuaranteeDetail
Update frequencyDaily at midnight GMT, 7 days a week
AvailabilityResults published within 30 minutes of run completion
Data retentionAll historical results retained indefinitely
Methodology consistencyScoring rubrics and pipeline versions are pinned per run and recorded in metadata
ReproducibilityEach run records model version, prompt hash, and pipeline config for full reproducibility
Staleness indicatorLeaderboard shows "Last updated X hours ago" badge; stale data (>36h) is flagged

Using Benchmark Data#

Choose the Right Model

Use Claim Verification and Hallucination Detection to shortlist models for research tasks where accuracy matters.

Track Reliability Over Time

Use 30-day or 90-day views to identify models with consistent quality, not just good days.

Optimize Cost

Cross-reference the General Intelligence quality score against cost-per-1k to find the best value model for your volume.

Detect Regressions

Watch the 7-day leaderboard for sudden drops — model providers update weights without announcements.