Skip to main content

Writing Quality Benchmark

Measures AI model prose quality using the stop-slop detector — PromptReports.ai's production filter that identifies writing patterns that erode trust and readability in research outputs.

What It Measures#

The Writing Quality benchmark tests whether AI-generated text reads like analyst-grade prose or generic AI output. It runs 50 fixed text samples through the stop-slop scoring engine — the same system PromptReports.ai applies to every report section before delivery. Samples include intentionally bad prose across 7 violation categories and clean analyst-style controls. Models pass when their output avoids these patterns and scores above 70/100.

50 Fixed Prompts

The same 50 text samples run every time — split between intentionally bad examples and clean controls. No model sees different prompts.

Consistent Scoring

The stop-slop detector is deterministic — same input always produces the same score. No LLM judge variance.

5 Dimensions

Directness, Rhythm, Trust, Authenticity, and Information Density — each scored separately.

Pass/Fail by Category

Each of the 50 prompts has a known correct answer (should pass or should fail). Accuracy is measured against ground truth.

Why Prose Quality Matters#

Research reports are only as valuable as the trust readers place in them. AI writing patterns like excessive hedging ("might arguably suggest"), filler phrases ("it is important to note that"), and generic adjectives ("significant", "notable", "key") are signals that the content was generated by a machine following a template — not by an expert who knows what they are talking about. These patterns reduce perceived credibility even when the underlying analysis is correct.

PromptReports.ai applies the stop-slop filter to every report section. Models that consistently fail Writing Quality benchmarks produce reports that need more editorial cleanup before delivery. This benchmark identifies which models generate the cleanest prose out of the box.

The 50-Prompt Test Set#

The prompt set is fixed — every benchmark run uses the same 50 samples. Samples are labeled with a ground truth: shouldPass: true (clean text that should score ≥ 70) orshouldPass: false (text with deliberate violations that should score below 70).

CategoryCountGround TruthWhat It Contains
Hedging violations5Should FAIL"arguably", "might", "could potentially", "seems to suggest"
Filler phrases5Should FAIL"It is important to note", "at the end of the day", "in today's landscape"
Generic adjectives3Should FAIL"significant", "notable", "key", "groundbreaking", "major", "important"
Passive voice3Should FAIL"was published by", "were confirmed by", "has been validated"
WH-openers3Should FAILSentences starting with What/Where/When/Why/How/Who as rhetorical devices
Binary contrasts2Should FAIL"Not a setback. A strategic pivot." sentence fragment pairs
Em-dash overuse2Should FAILThree or more em-dashes per sentence or paragraph
Mixed violations3Should FAILMultiple violation types combined in one sample
Analyst clean5Should PASSData-driven, direct, specific — Goldman upgrades, Fed decisions, M&A multiples
Narrative clean4Should PASSEvent-driven narrative without hedging — IPO prices, leadership changes
Edge cases3Should PASSShort, numeric, or list-format text with no violations
Clean controls10Should PASSSpecific, quantified, analyst-grade sentences with no AI slop patterns
TOTAL50

The 7 Violation Categories#

The stop-slop detector scans for seven categories of problematic AI writing patterns:

CategoryExamplesWhy It Fails
Hedging"arguably", "might", "could potentially", "seems to", "may indicate"Signals uncertainty in claims that should be stated directly. Undermines credibility.
Filler Phrases"It is important to note", "Furthermore", "In conclusion", "At the end of the day", "Here's the thing"Padding that adds length without adding information. Marks text as AI-generated.
Generic Adjectives"significant", "notable", "key", "important", "critical", "major", "groundbreaking"Vague intensifiers that substitute for specific data. "Revenue grew significantly" vs "Revenue grew 23%".
Passive Voice"was published by", "were confirmed", "has been validated", "will be announced"Obscures agency and reduces clarity. "The team published" beats "was published by the team".
WH-Openers"What makes this different…", "Why does this matter?", "How did this happen?"Rhetorical framing devices that avoid making direct claims. Common in generic AI explanations.
Binary Contrasts"Not a setback. A strategic pivot.", "Not cost-cutting. Efficiency optimization."Performative reframing that reads as spin, not analysis.
Em-Dash OveruseThree or more em-dashes in a single response — the most reliable AI tell — alongside other patterns.Overuse of em-dashes to create fake rhythm is one of the most consistent markers of AI-generated prose.

The Stop-Slop Scoring Engine#

The scoreContent() function from lib/utils/stop-slop-filter is a deterministic, rule-based scorer. It does not use an LLM. It runs regex and token-level pattern matching across the five dimensions below, returning scores on a 0–100 scale per dimension and an overall weighted composite.

Because the scorer is deterministic, Writing Quality benchmark results have zero variance — the same text always produces the same score. This makes the benchmark ideal for regression testing after model updates.

The 5-Dimension Rubric#

DimensionWhat It MeasuresViolations That Hurt It
DirectnessAre claims stated without hedging, filler, or rhetorical padding?Hedging words, WH-openers, filler phrases, binary contrasts
RhythmIs the sentence structure varied and natural, not formulaic?Em-dash overuse, repetitive sentence structures, passive voice patterns
TrustDoes the prose read as if written by a knowledgeable expert, not a generic AI?Generic adjectives, filler phrases, hedging, binary contrasts
AuthenticityIs the writing specific and grounded in real detail?Generic adjectives instead of specific numbers, passive voice obscuring agency
Information DensityIs each sentence carrying real informational weight?Filler phrases, WH-openers, binary contrasts that restate rather than inform

Pass/Fail Threshold#

A text sample passes when its overall score ≥ 70/100. The benchmark measures pass rate (what fraction of the 50 prompts score above threshold) and accuracy against ground truth (what fraction of intentionally-bad samples correctly score below threshold AND what fraction of clean controls correctly score above threshold).

Benchmark Metrics#

MetricDefinitionRange
Pass RateFraction of the 50 prompts that score ≥ 70. Primary leaderboard metric.0–100%
Average Overall ScoreMean overall score across all 50 prompts.0–100
Average DirectnessMean directness score across all 50 prompts.0–100
Average RhythmMean rhythm score across all 50 prompts.0–100
Average TrustMean trust score across all 50 prompts.0–100
Average AuthenticityMean authenticity score across all 50 prompts.0–100
Average DensityMean information density score across all 50 prompts.0–100
Average ViolationsMean violation count per prompt.Count
TrendDirection of change vs prior runs: IMPROVING, STABLE, or DECLINING.Categorical

Learning Signals#

When a prompt fails (overall score < 70), the benchmark emits a BenchmarkLearningSignalwith benchmarkType = WRITING_QUALITY and failureType set to the violation category that caused the failure.

The nightly learning loop reads these signals and adds new StopSlopPattern records for failure types that are not already in the library. This expands the stop-slop detection vocabulary over time, improving detection of emerging AI writing patterns.

Reading the Results#

View the live Writing Quality benchmark →