Running Evaluations

Execute your prompts against test datasets, analyze quality metrics, and ensure consistent performance before deployment.

Evaluation Basics#

An evaluation runs your prompt against every row in a test dataset, collecting outputs and quality metrics for analysis. This systematic approach replaces ad-hoc testing with reproducible, quantifiable quality measurement.

Systematic Testing

Every test case is executed with identical conditions for fair comparison.

Quality Metrics

Automatic scoring across relevance, coherence, completeness, and more.

Version Tracking

Compare results across prompt versions to measure improvement.

Fast Execution

Parallel processing enables evaluation of hundreds of cases in minutes.

Running an Evaluation#

Follow these steps to run a comprehensive evaluation:

Select Your Prompt

Navigate to the prompt you want to evaluate. Open the Evaluation tab from the prompt editor or detail view. You can evaluate any prompt version, not just the latest.

Choose a Test Dataset

Select a dataset from your available options. You can filter by dataset name, tags, or the prompt it's associated with. Consider using a pinned version for reproducibility.

Configure Settings

Adjust model parameters (temperature, max tokens), parallel request settings, and scoring criteria. Use production-like settings for realistic results.

Start the Evaluation

Click Run Evaluation to begin processing all test cases. A progress indicator shows completion status. You can navigate away—results are saved automatically.

Review Results

Once complete, view aggregate metrics and drill into individual test case results. Use filters to focus on specific categories or problem areas.

Processing Time

Evaluation time depends on dataset size, model speed, and parallelization settings. A 100-case dataset typically completes in 2-5 minutes. Large datasets (1000+ cases) may take 15-30 minutes. You'll receive a notification when complete.

Evaluation Settings#

Configure these settings before running an evaluation to control execution behavior and scoring:

Setting	Description	Recommendation
Model	AI model to use for this evaluation	Use the same model you'll use in production
Temperature	Response randomness (0-1)	Lower values (0.3-0.5) for more consistent, reproducible results
Max Tokens	Maximum response length	Set based on expected output length plus 20% buffer
Parallel Requests	Number of concurrent API calls	3-5 balances speed and reliability; higher risks rate limits
Request Timeout	Maximum wait time per request	30-60 seconds for complex prompts; increase for long outputs
Retry Failed	Automatically retry failed requests	Enable with 2-3 retries and exponential backoff
Auto-Score	Enable automatic quality scoring	Yes, unless you prefer manual review
Score Model	Model used for scoring (if different)	Use a capable model like GPT-4 for accurate scoring

Example Evaluation Configuration

json

{
  "model": "gpt-4-turbo",
  "temperature": 0.4,
  "maxTokens": 2000,
  "parallelRequests": 5,
  "timeoutMs": 45000,
  "retryConfig": {
    "maxRetries": 3,
    "backoffMultiplier": 2
  },
  "scoring": {
    "enabled": true,
    "model": "gpt-4-turbo",
    "metrics": ["relevance", "coherence", "completeness", "accuracy", "tone"]
  }
}

Parallel Processing#

PromptReports processes multiple test cases simultaneously to reduce evaluation time. Configure parallelization based on your API rate limits and reliability requirements.

Speed vs. Reliability

Higher parallelism is faster but increases risk of rate limiting or errors.

Automatic Retry

Failed requests are automatically retried with exponential backoff.

Batch Ordering

Cases are processed in parallel but results maintain original order.

Rate Limit Awareness

Built-in throttling respects API rate limits to prevent failures.

Parallel Requests	Speed	Reliability	Best For
1-2	Slow	Highest	Testing settings, debugging issues
3-5	Medium	High	Standard evaluations, most use cases
6-10	Fast	Medium	Large datasets, high API limits
10+	Fastest	Lower	Enterprise accounts with high limits

Understanding Results#

After an evaluation completes, you'll see a comprehensive results dashboard:

Metric	Description	Healthy Range
Overall Score	Weighted average across all quality metrics	4.0+ for production-ready prompts
Pass Rate	Percentage of cases meeting quality threshold (default: 4.0)	90%+ for critical customer-facing prompts
Avg Response Time	Mean response latency across all cases	Under 5 seconds for interactive use
Token Usage	Total input/output tokens consumed	Monitor for cost optimization
Error Rate	Percentage of failed API requests	Under 1% with retries enabled
Score Variance	Standard deviation of quality scores	Low variance indicates consistent behavior

Score Distribution

Histogram showing how scores are distributed across test cases.

Category Breakdown

See average scores per category to identify problem areas.

Low Score Alerts

Automatic highlighting of cases scoring below threshold.

Performance Metrics

Latency percentiles (p50, p90, p99) for understanding response times.

Focus on Weak Points

Sort results by score (ascending) to quickly identify your prompt's weakest areas. Improving the lowest-scoring cases often provides the biggest overall quality gains.

Reviewing Individual Outputs#

Drill down into individual test case results to understand quality patterns and identify specific improvements:

Full Output View

See the complete response with syntax highlighting and formatting.

Score Breakdown

View individual metric scores with detailed scoring rationale.

Input Context

See the exact variable values and context used for each case.

Performance Data

Response time and token counts for each individual execution.

For each test case, you can:

Compare to Expected: Side-by-side view of actual vs. expected output
View Scoring Rationale: Detailed explanation of why each metric received its score
Flag for Review: Mark cases that need human review or prompt adjustment
Add Notes: Document observations for future reference
Re-run Single Case: Execute just this case with different settings

Comparing Evaluations#

Compare evaluation results across different prompt versions, models, or settings to make data-driven decisions:

Select Evaluations to Compare

Choose two or more completed evaluations from your history. They should use the same or similar datasets for meaningful comparison.

View Side-by-Side Metrics

See aggregate metrics compared in a table, with color coding to highlight improvements and regressions.

Analyze Case-by-Case Differences

For each test case, view both outputs and their respective scores to understand where versions differ.

Generate Comparison Report

Create a summary report documenting the comparison results for stakeholders or future reference.

Statistical Significance

For small differences (less than 0.2 points), run the comparison again with a larger dataset to confirm the difference is statistically significant, not just noise. See A/B Testing for rigorous statistical comparison.

Regression Testing#

Regression testing automatically compares new prompt versions against a baseline to prevent quality degradation:

Baseline Comparison

New versions are compared against the current production baseline.

Regression Detection

Automatic alerts when quality drops below the baseline threshold.

Quality Gates

Block promotions if quality regresses more than the allowed percentage.

Promotion Blocking

Prevent deployment of versions that fail regression tests.

Threshold	Behavior	Use Case
0% (No Regression)	Block if any metric is lower	Critical customer-facing prompts
5% (Default)	Block if overall score drops more than 5%	Standard production prompts
10% (Lenient)	Allow small regressions for faster iteration	Internal tools, experimentation
Custom per Metric	Different thresholds for each quality metric	Nuanced quality requirements

To set up regression testing:

Navigate to your prompt's Settings tab
Enable Regression Testing
Select a baseline version (usually the current production version)
Choose a test dataset to use for regression checks
Set your regression threshold (default: 5%)
Optionally enable automatic regression testing on version creation

Exporting Results#

Export evaluation results for further analysis, reporting, or integration with external tools:

Format	Contents	Best For
CSV	All test cases with inputs, outputs, and scores	Spreadsheet analysis, sharing with stakeholders
JSON	Complete evaluation data including metadata	Integration with other tools, programmatic analysis
PDF Report	Formatted summary with charts and key metrics	Executive summaries, documentation, audits
HTML Report	Interactive report for web viewing	Sharing via email or internal portals

Export JSON Structure

json

{
  "evaluationId": "eval_abc123",
  "promptId": "prompt_xyz",
  "promptVersion": 5,
  "datasetId": "dataset_456",
  "datasetVersion": 2,
  "runAt": "2024-01-15T10:30:00Z",
  "config": { /* evaluation settings */ },
  "summary": {
    "totalCases": 100,
    "passRate": 0.92,
    "overallScore": 4.23,
    "avgLatencyMs": 2340,
    "totalTokens": 125000
  },
  "results": [
    {
      "caseId": "case_1",
      "input": { /* input variables */ },
      "output": "...",
      "expectedOutput": "...",
      "scores": { "relevance": 4, "coherence": 5, /* ... */ },
      "latencyMs": 2100,
      "tokensUsed": 1250
    }
    /* ... more cases */
  ]
}

Scheduling Evaluations#

Schedule evaluations to run automatically for continuous quality monitoring:

Periodic Runs

Schedule daily, weekly, or monthly evaluations on production prompts.

Quality Alerts

Get notified if scheduled evaluations detect quality degradation.

Trend Analysis

Track quality metrics over time to spot gradual degradation.

Auto-Baseline

Automatically update baselines when quality improves.

Scheduled evaluations are especially useful for:

Model Updates: Detect if upstream model changes affect your prompt's quality
Long-Term Monitoring: Track quality trends over weeks or months
Compliance: Maintain audit trails of quality measurements
SLA Monitoring: Ensure prompts consistently meet quality requirements

Cost Consideration

Scheduled evaluations consume API credits. Monitor your usage and adjust frequency based on criticality. Consider using smaller sampling of your datasets for frequent checks and full datasets for weekly deep evaluations.

PreviousTest Datasets

NextA/B Testing