Writing Quality Benchmark

Measures AI model prose quality using the stop-slop detector — PromptReports.ai's production filter that identifies writing patterns that erode trust and readability in research outputs.

What It Measures#

The Writing Quality benchmark tests whether AI-generated text reads like analyst-grade prose or generic AI output. It runs 50 fixed text samples through the stop-slop scoring engine — the same system PromptReports.ai applies to every report section before delivery. Samples include intentionally bad prose across 7 violation categories and clean analyst-style controls. Models pass when their output avoids these patterns and scores above 70/100.

50 Fixed Prompts

The same 50 text samples run every time — split between intentionally bad examples and clean controls. No model sees different prompts.

Consistent Scoring

The stop-slop detector is deterministic — same input always produces the same score. No LLM judge variance.

5 Dimensions

Directness, Rhythm, Trust, Authenticity, and Information Density — each scored separately.

Pass/Fail by Category

Each of the 50 prompts has a known correct answer (should pass or should fail). Accuracy is measured against ground truth.

Why Prose Quality Matters#

Research reports are only as valuable as the trust readers place in them. AI writing patterns like excessive hedging ("might arguably suggest"), filler phrases ("it is important to note that"), and generic adjectives ("significant", "notable", "key") are signals that the content was generated by a machine following a template — not by an expert who knows what they are talking about. These patterns reduce perceived credibility even when the underlying analysis is correct.

PromptReports.ai applies the stop-slop filter to every report section. Models that consistently fail Writing Quality benchmarks produce reports that need more editorial cleanup before delivery. This benchmark identifies which models generate the cleanest prose out of the box.

The 50-Prompt Test Set#

The prompt set is fixed — every benchmark run uses the same 50 samples. Samples are labeled with a ground truth: shouldPass: true (clean text that should score ≥ 70) orshouldPass: false (text with deliberate violations that should score below 70).

Category	Count	Ground Truth	What It Contains
Hedging violations	5	Should FAIL	"arguably", "might", "could potentially", "seems to suggest"
Filler phrases	5	Should FAIL	"It is important to note", "at the end of the day", "in today's landscape"
Generic adjectives	3	Should FAIL	"significant", "notable", "key", "groundbreaking", "major", "important"
Passive voice	3	Should FAIL	"was published by", "were confirmed by", "has been validated"
WH-openers	3	Should FAIL	Sentences starting with What/Where/When/Why/How/Who as rhetorical devices
Binary contrasts	2	Should FAIL	"Not a setback. A strategic pivot." sentence fragment pairs
Em-dash overuse	2	Should FAIL	Three or more em-dashes per sentence or paragraph
Mixed violations	3	Should FAIL	Multiple violation types combined in one sample
Analyst clean	5	Should PASS	Data-driven, direct, specific — Goldman upgrades, Fed decisions, M&A multiples
Narrative clean	4	Should PASS	Event-driven narrative without hedging — IPO prices, leadership changes
Edge cases	3	Should PASS	Short, numeric, or list-format text with no violations
Clean controls	10	Should PASS	Specific, quantified, analyst-grade sentences with no AI slop patterns
TOTAL	50	—

The 7 Violation Categories#

The stop-slop detector scans for seven categories of problematic AI writing patterns:

Category	Examples	Why It Fails
Hedging	"arguably", "might", "could potentially", "seems to", "may indicate"	Signals uncertainty in claims that should be stated directly. Undermines credibility.
Filler Phrases	"It is important to note", "Furthermore", "In conclusion", "At the end of the day", "Here's the thing"	Padding that adds length without adding information. Marks text as AI-generated.
Generic Adjectives	"significant", "notable", "key", "important", "critical", "major", "groundbreaking"	Vague intensifiers that substitute for specific data. "Revenue grew significantly" vs "Revenue grew 23%".
Passive Voice	"was published by", "were confirmed", "has been validated", "will be announced"	Obscures agency and reduces clarity. "The team published" beats "was published by the team".
WH-Openers	"What makes this different…", "Why does this matter?", "How did this happen?"	Rhetorical framing devices that avoid making direct claims. Common in generic AI explanations.
Binary Contrasts	"Not a setback. A strategic pivot.", "Not cost-cutting. Efficiency optimization."	Performative reframing that reads as spin, not analysis.
Em-Dash Overuse	Three or more em-dashes in a single response — the most reliable AI tell — alongside other patterns.	Overuse of em-dashes to create fake rhythm is one of the most consistent markers of AI-generated prose.

The Stop-Slop Scoring Engine#

The scoreContent() function from lib/utils/stop-slop-filter is a deterministic, rule-based scorer. It does not use an LLM. It runs regex and token-level pattern matching across the five dimensions below, returning scores on a 0–100 scale per dimension and an overall weighted composite.

Because the scorer is deterministic, Writing Quality benchmark results have zero variance — the same text always produces the same score. This makes the benchmark ideal for regression testing after model updates.

The 5-Dimension Rubric#

Dimension	What It Measures	Violations That Hurt It
Directness	Are claims stated without hedging, filler, or rhetorical padding?	Hedging words, WH-openers, filler phrases, binary contrasts
Rhythm	Is the sentence structure varied and natural, not formulaic?	Em-dash overuse, repetitive sentence structures, passive voice patterns
Trust	Does the prose read as if written by a knowledgeable expert, not a generic AI?	Generic adjectives, filler phrases, hedging, binary contrasts
Authenticity	Is the writing specific and grounded in real detail?	Generic adjectives instead of specific numbers, passive voice obscuring agency
Information Density	Is each sentence carrying real informational weight?	Filler phrases, WH-openers, binary contrasts that restate rather than inform

Overall Score Formula

The overall score (0–100) is a weighted composite of the five dimension scores. Each dimension is scored 0–100 independently. Directness and Trust carry the highest weights, reflecting their impact on research report credibility.

Pass/Fail Threshold#

A text sample passes when its overall score ≥ 70/100. The benchmark measures pass rate (what fraction of the 50 prompts score above threshold) and accuracy against ground truth (what fraction of intentionally-bad samples correctly score below threshold AND what fraction of clean controls correctly score above threshold).

What 70 Means

A score of 70 does not mean 30% of the text is bad. It means the text crosses the threshold of analyst-grade prose — specific, direct, and free of the most common AI writing tells. Clean analyst samples from real research reports routinely score 85–95. Mixed-violation samples typically score 20–45.

Benchmark Metrics#

Metric	Definition	Range
Pass Rate	Fraction of the 50 prompts that score ≥ 70. Primary leaderboard metric.	0–100%
Average Overall Score	Mean overall score across all 50 prompts.	0–100
Average Directness	Mean directness score across all 50 prompts.	0–100
Average Rhythm	Mean rhythm score across all 50 prompts.	0–100
Average Trust	Mean trust score across all 50 prompts.	0–100
Average Authenticity	Mean authenticity score across all 50 prompts.	0–100
Average Density	Mean information density score across all 50 prompts.	0–100
Average Violations	Mean violation count per prompt.	Count
Trend	Direction of change vs prior runs: IMPROVING, STABLE, or DECLINING.	Categorical

Learning Signals#

When a prompt fails (overall score < 70), the benchmark emits a BenchmarkLearningSignalwith benchmarkType = WRITING_QUALITY and failureType set to the violation category that caused the failure.

The nightly learning loop reads these signals and adds new StopSlopPattern records for failure types that are not already in the library. This expands the stop-slop detection vocabulary over time, improving detection of emerging AI writing patterns.

Reading the Results#

Interpreting Writing Quality Scores

Pass Rate > 80%: Model consistently writes clean, direct prose. Low editing overhead for report delivery.
Pass Rate 60–80%: Mixed. Analyst-style prompts may pass; research summaries may need review.
Pass Rate < 60%: High editing overhead. Model produces significant AI-slop patterns in output.
Check the radar chart on the benchmark page to see which dimension is dragging the score down.
High Trust + Low Directness: Model sounds knowledgeable but hedges too much. Reduce temperature or prompt more directly.
Low Authenticity: Model uses generic adjectives instead of specific data. Prompt for numerical specificity.

View the live Writing Quality benchmark →

PreviousHallucination Detection

NextKnowledge Base