Skip to main content

Quality Metrics

Understand and configure the metrics used to evaluate prompt quality, from built-in standards to custom domain-specific criteria.

Understanding Quality Metrics#

Quality metrics are standardized measurements used to evaluate how well a prompt's output meets expectations. They transform subjective quality assessment into objective, comparable numbers—enabling data-driven prompt optimization.

PromptReports uses AI-powered evaluation to score outputs automatically, providing:

Consistent Measurement

Same criteria applied uniformly across all evaluations for fair comparison.

Intelligent Assessment

AI evaluator understands context and nuance, not just keyword matching.

Multi-Dimensional

Capture different aspects of quality with multiple complementary metrics.

Explainable Scores

Get written rationale for each score, not just numbers.

Built-in Metrics#

PromptReports includes these standard quality metrics that work across most use cases:

MetricWhat It MeasuresKey Indicators
RelevanceDoes the response address what was asked?On-topic, answers the question, doesn't go off on tangents
CoherenceIs the response well-organized and clear?Logical flow, clear transitions, easy to follow
CompletenessAre all required elements present?Covers all parts of the request, nothing important missing
AccuracyIs the information factually correct?True statements, correct calculations, valid reasoning
ToneDoes it match the expected style?Appropriate formality, consistent voice, suitable sentiment
OverallAggregate quality assessmentWeighted combination of all metrics

Relevance

How well does the response actually answer what was asked? Does it stay on topic and address the core request?

Coherence

Is the response well-organized, logically structured, and easy to follow? Do ideas flow naturally?

Completeness

Are all required elements present? Does it fully address all parts of the request without omitting key details?

Accuracy

Is the information factually correct? Are calculations right and reasoning valid?

Metric Scoring Scale#

Each metric is scored on a consistent 1-5 scale:

ScoreRatingDescriptionAction Needed
5ExcellentExceeds expectations, exemplary quality, no issuesNone - this is the goal
4GoodMeets expectations with only minor issuesMinor refinements possible
3AcceptableFunctional but with notable issuesImprovements recommended
2PoorSignificant problems affecting usefulnessPrompt revision needed
1FailingUnacceptable, major issues presentSignificant rework required

Half-point scores (e.g., 3.5, 4.5) are used when quality falls between two levels. The evaluator provides specific rationale for each score, explaining exactly what pushed the score up or down.

How Scoring Works#

PromptReports uses AI-powered evaluation to score outputs. Here's how the process works:

1

Input Analysis

The evaluator examines the original input and understands what was requested.
2

Output Examination

The actual output is analyzed against what the input requested.
3

Expected Comparison

If an expected output is provided, the evaluator compares against it.
4

Metric Assessment

Each enabled metric is evaluated independently using consistent criteria.
5

Rationale Generation

Written explanations are generated for each score.
6

Aggregate Calculation

Overall score is computed using configured weights.
Example Scoring Output
json
{
  "inputSummary": "Customer asking about return policy for electronics",
  "outputLength": 245,
  "scores": {
    "relevance": {
      "score": 5,
      "rationale": "Response directly addresses the return policy question with specific details about electronics. No off-topic content."
    },
    "coherence": {
      "score": 4,
      "rationale": "Well-organized with clear steps. Minor issue: could benefit from numbered list format."
    },
    "completeness": {
      "score": 4.5,
      "rationale": "Covers return window, process, and conditions. Slight deduction for not mentioning receipt requirements."
    },
    "accuracy": {
      "score": 5,
      "rationale": "All policy details match expected reference. No factual errors."
    },
    "tone": {
      "score": 5,
      "rationale": "Friendly, helpful, and professional. Appropriate for customer support context."
    }
  },
  "overall": 4.7,
  "summary": "High-quality response that effectively addresses the customer's question with accurate policy information and appropriate tone."
}

Custom Metrics#

Create custom evaluation criteria for domain-specific quality requirements that built-in metrics don't capture:

1

Define the Metric

Give it a clear name and description of what it measures.
2

Specify Scoring Criteria

List the specific characteristics the evaluator should look for at each score level.
3

Provide Examples

Include examples of outputs at different score levels to calibrate the evaluator.
4

Set Weight

Determine how much this metric contributes to the overall score.
5

Test & Calibrate

Run evaluations and adjust criteria based on scoring behavior.
Custom Metric Definition
json
{
  "name": "Brand Voice",
  "description": "Alignment with company brand guidelines and communication standards",
  "scoringCriteria": {
    "5": "Perfect brand voice: uses all approved terminology, maintains consistent tone, embodies company values throughout",
    "4": "Strong brand voice with minor deviations: mostly aligned with guidelines, occasional neutral language",
    "3": "Acceptable brand voice: recognizable as company content but lacks some brand elements",
    "2": "Weak brand voice: generic language, missing brand personality, some off-brand phrasing",
    "1": "Off-brand: uses prohibited terms, inappropriate tone, or contradicts brand values"
  },
  "evaluationGuidelines": [
    "Check for use of approved terminology (see brand guide)",
    "Verify consistent first-person plural ('we') usage",
    "Confirm positive, solution-oriented framing",
    "Watch for prohibited phrases from blocklist"
  ],
  "examples": {
    "5": "We're here to help! Our team will have this sorted for you within 24 hours.",
    "3": "Your issue has been noted. Someone will contact you.",
    "1": "Unfortunately, that's not possible. You'll have to wait."
  },
  "weight": 1.5
}

Safety & Compliance

Check for prohibited content, required disclosures, or regulatory compliance.

Creativity

Measure originality, engaging language, or innovative problem-solving.

Format Adherence

Verify output follows required structure, length limits, or formatting rules.

Technical Accuracy

Assess domain-specific correctness for technical or specialized content.

Weighted Scoring#

Not all metrics are equally important for every use case. Configure weights to emphasize what matters most:

MetricDefault WeightExample: Support BotExample: Content Writer
Relevance1.01.5 (critical to answer correctly)1.0 (standard)
Coherence1.00.8 (less critical)1.5 (very important for readability)
Completeness1.01.2 (need full answers)1.0 (standard)
Accuracy1.01.5 (must be correct)0.8 (creative latitude)
Tone1.01.3 (brand representation)1.2 (engagement)
Weight Configuration
json
{
  "weights": {
    "relevance": 1.5,
    "coherence": 0.8,
    "completeness": 1.2,
    "accuracy": 1.5,
    "tone": 1.3,
    "brand_voice": 1.5
  },
  "overallCalculation": "weighted_average",
  "minimumThresholds": {
    "accuracy": 4.0,
    "brand_voice": 3.5
  }
}

The overall score is calculated as a weighted average: sum(score * weight) / sum(weights). You can also set minimum thresholds for critical metrics—if any threshold isn't met, the case is flagged regardless of overall score.

Interpreting Scores#

Use these guidelines to translate numeric scores into actionable insights:

Score RangeQuality LevelInterpretationRecommended Action
4.5 - 5.0Production-readyExcellent quality suitable for customer-facing useDeploy with confidence
4.0 - 4.4GoodSolid quality with minor improvement opportunitiesDeployable, consider refinements
3.5 - 3.9AcceptableFunctional but noticeable issues existOK for internal use, improve for external
3.0 - 3.4Needs WorkSignificant gaps affecting user experienceAddress issues before deployment
Below 3.0FailingMajor problems, not ready for useSubstantial revision required

Track Trends

Monitor how scores change across versions to ensure continuous improvement.

Watch Variance

High score variance indicates inconsistent prompt behavior that needs attention.

Category Analysis

Break down scores by input category to find specific weak points.

Distribution Shape

Check if scores cluster high (good) or spread out (inconsistent).

Quality Thresholds#

Set quality thresholds to create automated quality gates and alerts:

Use CaseRecommended ThresholdRationale
Customer Support4.5High stakes, brand reputation, user trust
Content Marketing4.0Quality matters but some variation acceptable
Internal Tools3.5Efficiency-focused, expert users can handle imperfection
Prototyping3.0Exploring ideas, not production-ready
Data Processing4.0 (Accuracy: 4.5)Overall can be moderate but accuracy critical

Thresholds are used in multiple contexts:

  • Regression Testing: Block promotions if quality drops below threshold
  • Evaluation Reports: Highlight cases failing to meet threshold
  • Quality Alerts: Notify when production prompts fall below threshold
  • CI/CD Integration: Fail builds if evaluations don't pass
  • Pass Rate Calculation: Count percentage of cases meeting threshold
Threshold Configuration
json
{
  "thresholds": {
    "overall": 4.0,
    "accuracy": 4.5,
    "brand_voice": 4.0
  },
  "passRequirements": {
    "overall": ">=",
    "accuracy": ">="
  },
  "actions": {
    "onThresholdFailed": ["alert_owner", "block_promotion"],
    "onAllThresholdsPassed": ["allow_promotion"]
  },
  "alerts": {
    "recipients": ["team@example.com"],
    "frequency": "immediate"
  }
}

Tracking Over Time#

PromptReports tracks quality metrics over time, enabling trend analysis and early detection of degradation:

Score Trends

Visualize how quality changes across versions and over time.

Historical Comparison

Compare any evaluation to historical baselines.

Degradation Alerts

Get notified when quality trends downward over multiple evaluations.

Metric Breakdown

Track individual metrics to see which aspects improve or decline.

Quality tracking helps you:

  • Verify that prompt changes actually improve quality
  • Detect gradual degradation from model updates or data drift
  • Demonstrate quality improvements to stakeholders
  • Identify the best-performing version for rollback if needed
  • Build confidence in your prompt development process

Advanced Scoring#

For specialized evaluation needs, PromptReports supports advanced scoring configurations:

Custom Evaluator Prompts

Write your own evaluation prompts for complete control over scoring logic.

Multi-Model Scoring

Use different models for generation and evaluation to reduce bias.

Human-in-the-Loop

Route edge cases to human reviewers for manual scoring.

Programmatic Scoring

Add code-based checks for format validation, length limits, or regex patterns.

Advanced Scoring Configuration
json
{
  "scoring": {
    "aiScoring": {
      "enabled": true,
      "model": "gpt-4-turbo",
      "customPrompt": "Evaluate this customer service response...",
      "temperature": 0.1
    },
    "programmaticChecks": [
      {
        "name": "length_check",
        "type": "range",
        "field": "output_length",
        "min": 50,
        "max": 500,
        "scoreImpact": -1
      },
      {
        "name": "prohibited_words",
        "type": "regex_absent",
        "pattern": "(?i)(unfortunately|cannot|impossible)",
        "scoreImpact": -0.5
      }
    ],
    "humanReview": {
      "enabled": true,
      "triggerCondition": "score < 3.5 OR variance > 1.0",
      "assignTo": "quality_team"
    }
  }
}

Quality metrics are the foundation of systematic prompt improvement. With consistent measurement, you can make confident decisions about which prompts to deploy and continuously raise the bar on output quality.