Skip to main content
Transparent AI Evaluation

How We Score AI Models

Our benchmark leaderboard uses a multi-dimensional LLM-as-Judge evaluation system with weighted rubrics to produce objective, transparent scores you can trust.

See Methodology
5
Evaluation Dimensions
9
Benchmark Categories
50
Daily Test Prompts

Why Transparent Scoring Matters

Most AI leaderboards use opaque evaluation methods or single-metric comparisons. Our system provides dimension-level transparency so you understand exactly why each model scored the way it did.

Full Transparency

See reasoning for every dimension score, not just a final number

Category-Specific Weights

Math tasks weight accuracy differently than creative writing

Rolling Averages

30-day averages show consistent performance, not lucky one-offs

Evaluation Framework

5 Dimensions of Quality

Every AI response is evaluated on exactly 5 dimensions, each scored from 0-100. The fifth dimension is category-specific.

Accuracy

Factual correctness, no false claims

Completeness

All aspects covered, nothing missing

Clarity

Well-organized, easy to follow

Relevance

Directly addresses the question

Category-Specific

Varies by task type (see below)

The Formula

How Scores Are Calculated

Overall Score = Weighted Sum of All Dimensions
Score = (Accuracy × Wa) + (Completeness × Wc) + (Clarity × Wcl) + (Relevance × Wr) + (Category × Wcat)

Example: Knowledge Category

80
Accuracy
× 35%
75
Completeness
× 25%
85
Clarity
× 15%
70
Relevance
× 10%
82
Depth
× 15%
= 28 + 18.75 + 12.75 + 7 + 12.3 = 78.80/100

LLM-as-Judge

Claude Sonnet 4 evaluates each response using category-specific rubrics. If it fails, Gemini 2.0 Flash is used as fallback.

Rolling Averages

Leaderboard shows 30-day rolling averages by default. View 1d, 7d, 60d, 90d, 180d, or all-time.

Category Weights

9 Benchmark Categories

Each category weights dimensions differently. Math tasks weight accuracy heavily, while creative writing prioritizes creativity.

Knowledge & Facts

Accuracy
35%
Completeness
25%
Clarity
15%
Relevance
10%
Depth of Explanation
15%
Factual error → max 40 accuracy

Mathematical Reasoning

Accuracy
40%
Completeness
20%
Clarity
15%
Relevance
5%
Mathematical Rigor
20%
Wrong final answer → max 25 accuracy

Code Generation

Accuracy
35%
Completeness
20%
Clarity
15%
Relevance
5%
Code Quality
25%
Bug in output → max 30 accuracy

Reasoning & Logic

Accuracy
35%
Completeness
20%
Clarity
15%
Relevance
10%
Logical Rigor
20%
Wrong conclusion → max 30 accuracy

Creative Writing

Accuracy
10%
Completeness
20%
Clarity
20%
Relevance
10%
Creativity & Voice
40%

Truthfulness

Accuracy
40%
Completeness
15%
Clarity
15%
Relevance
10%
Epistemic Honesty
20%
Affirms myth as true → max 20 accuracy

Summarization

Accuracy
30%
Completeness
25%
Clarity
15%
Relevance
10%
Conciseness
20%

Instruction Following

Accuracy
20%
Completeness
25%
Clarity
15%
Relevance
10%
Constraint Compliance
30%

Language Understanding

Accuracy
30%
Completeness
20%
Clarity
20%
Relevance
10%
Linguistic Sophistication
20%
Score Guide

What Scores Mean

90-100
Exceptional
No meaningful flaws; excellent on all dimensions
80-89
Very Good
Strong performance with minor issues only
70-79
Good
Solid response with some noticeable weaknesses
60-69
Acceptable
Meets minimum requirements but has clear limitations
50-59
Mediocre
Significant issues in execution or completeness
30-49
Poor
Major flaws; fails on core requirements
0-29
Failed
Completely wrong answer or critical error

Why Choose Our Benchmarks

Dimension-Level Reasoning

Don't just see a number—understand exactly why each model scored the way it did with written reasoning for every dimension.

Task-Appropriate Weights

A math benchmark shouldn't weight creativity. Our category-specific weights ensure fair evaluation for each task type.

Critical Error Detection

Wrong math answers and factual errors automatically cap scores. No model can score high while making fundamental mistakes.

Consistent Daily Testing

50 benchmark prompts run on a 50-day rotation. Rolling averages show consistent performance, not cherry-picked results.

Live Benchmark Data

See How AI Models Compare

View the live leaderboard with transparent scores, dimension breakdowns, and rolling averages across all benchmark categories.

Read Full Documentation

Updated daily at midnight GMT