Transparent AI Evaluation

How We Score AI Models

Our benchmark leaderboard uses a multi-dimensional LLM-as-Judge evaluation system with weighted rubrics to produce objective, transparent scores you can trust.

See Methodology

Evaluation Dimensions

Benchmark Categories

Daily Test Prompts

Why Transparent Scoring Matters

Most AI leaderboards use opaque evaluation methods or single-metric comparisons. Our system provides dimension-level transparency so you understand exactly why each model scored the way it did.

Full Transparency

See reasoning for every dimension score, not just a final number

Category-Specific Weights

Math tasks weight accuracy differently than creative writing

Rolling Averages

30-day averages show consistent performance, not lucky one-offs

Evaluation Framework

5 Dimensions of Quality

Every AI response is evaluated on exactly 5 dimensions, each scored from 0-100. The fifth dimension is category-specific.

Accuracy

Factual correctness, no false claims

Completeness

All aspects covered, nothing missing

Clarity

Well-organized, easy to follow

Relevance

Directly addresses the question

Category-Specific

Varies by task type (see below)

The Formula

How Scores Are Calculated

Overall Score = Weighted Sum of All Dimensions

Score = (Accuracy × W_a) + (Completeness × W_c) + (Clarity × W_cl) + (Relevance × W_r) + (Category × W_cat)

Example: Knowledge Category

Accuracy

× 35%

Completeness

× 25%

Clarity

× 15%

Relevance

× 10%

Depth

× 15%

= 28 + 18.75 + 12.75 + 7 + 12.3 = 78.80/100

LLM-as-Judge

Claude Sonnet 4 evaluates each response using category-specific rubrics. If it fails, Gemini 2.0 Flash is used as fallback.

Rolling Averages

Leaderboard shows 30-day rolling averages by default. View 1d, 7d, 60d, 90d, 180d, or all-time.

Category Weights

9 Benchmark Categories

Each category weights dimensions differently. Math tasks weight accuracy heavily, while creative writing prioritizes creativity.

Knowledge & Facts

Accuracy

35%

Completeness

25%

Clarity

15%

Relevance

10%

Depth of Explanation

15%

Factual error → max 40 accuracy

Mathematical Reasoning

Accuracy

40%

Completeness

20%

Clarity

15%

Relevance

Mathematical Rigor

20%

Wrong final answer → max 25 accuracy

Code Generation

Accuracy

35%

Completeness

20%

Clarity

15%

Relevance

Code Quality

25%

Bug in output → max 30 accuracy

Reasoning & Logic

Accuracy

35%

Completeness

20%

Clarity

15%

Relevance

10%

Logical Rigor

20%

Wrong conclusion → max 30 accuracy

Creative Writing

Accuracy

10%

Completeness

20%

Clarity

20%

Relevance

10%

Creativity & Voice

40%

Truthfulness

Accuracy

40%

Completeness

15%

Clarity

15%

Relevance

10%

Epistemic Honesty

20%

Affirms myth as true → max 20 accuracy

Summarization

Accuracy

30%

Completeness

25%

Clarity

15%

Relevance

10%

Conciseness

20%

Instruction Following

Accuracy

20%

Completeness

25%

Clarity

15%

Relevance

10%

Constraint Compliance

30%

Language Understanding

Accuracy

30%

Completeness

20%

Clarity

20%

Relevance

10%

Linguistic Sophistication

20%

Score Guide

What Scores Mean

90-100

Exceptional

No meaningful flaws; excellent on all dimensions

80-89

Very Good

Strong performance with minor issues only

70-79

Good

Solid response with some noticeable weaknesses

60-69

Acceptable

Meets minimum requirements but has clear limitations

50-59

Mediocre

Significant issues in execution or completeness

30-49

Poor

Major flaws; fails on core requirements

0-29

Failed

Completely wrong answer or critical error

Why Choose Our Benchmarks

Dimension-Level Reasoning

Don't just see a number—understand exactly why each model scored the way it did with written reasoning for every dimension.

Task-Appropriate Weights

A math benchmark shouldn't weight creativity. Our category-specific weights ensure fair evaluation for each task type.

Critical Error Detection

Wrong math answers and factual errors automatically cap scores. No model can score high while making fundamental mistakes.

Consistent Daily Testing

50 benchmark prompts run on a 50-day rotation. Rolling averages show consistent performance, not cherry-picked results.

Live Benchmark Data

See How AI Models Compare

View the live leaderboard with transparent scores, dimension breakdowns, and rolling averages across all benchmark categories.

Read Full Documentation

Updated daily at midnight GMT