How We Score AI Models
Our benchmark leaderboard uses a multi-dimensional LLM-as-Judge evaluation system with weighted rubrics to produce objective, transparent scores you can trust.
Why Transparent Scoring Matters
Most AI leaderboards use opaque evaluation methods or single-metric comparisons. Our system provides dimension-level transparency so you understand exactly why each model scored the way it did.
Full Transparency
See reasoning for every dimension score, not just a final number
Category-Specific Weights
Math tasks weight accuracy differently than creative writing
Rolling Averages
30-day averages show consistent performance, not lucky one-offs
5 Dimensions of Quality
Every AI response is evaluated on exactly 5 dimensions, each scored from 0-100. The fifth dimension is category-specific.
Accuracy
Factual correctness, no false claims
Completeness
All aspects covered, nothing missing
Clarity
Well-organized, easy to follow
Relevance
Directly addresses the question
Category-Specific
Varies by task type (see below)
How Scores Are Calculated
Example: Knowledge Category
LLM-as-Judge
Claude Sonnet 4 evaluates each response using category-specific rubrics. If it fails, Gemini 2.0 Flash is used as fallback.
Rolling Averages
Leaderboard shows 30-day rolling averages by default. View 1d, 7d, 60d, 90d, 180d, or all-time.
9 Benchmark Categories
Each category weights dimensions differently. Math tasks weight accuracy heavily, while creative writing prioritizes creativity.
Knowledge & Facts
Mathematical Reasoning
Code Generation
Reasoning & Logic
Creative Writing
Truthfulness
Summarization
Instruction Following
Language Understanding
What Scores Mean
Why Choose Our Benchmarks
Dimension-Level Reasoning
Don't just see a number—understand exactly why each model scored the way it did with written reasoning for every dimension.
Task-Appropriate Weights
A math benchmark shouldn't weight creativity. Our category-specific weights ensure fair evaluation for each task type.
Critical Error Detection
Wrong math answers and factual errors automatically cap scores. No model can score high while making fundamental mistakes.
Consistent Daily Testing
50 benchmark prompts run on a 50-day rotation. Rolling averages show consistent performance, not cherry-picked results.
See How AI Models Compare
View the live leaderboard with transparent scores, dimension breakdowns, and rolling averages across all benchmark categories.
Updated daily at midnight GMT