Running Evaluations
Execute your prompts against test datasets, analyze quality metrics, and ensure consistent performance before deployment.
Evaluation Basics#
An evaluation runs your prompt against every row in a test dataset, collecting outputs and quality metrics for analysis. This systematic approach replaces ad-hoc testing with reproducible, quantifiable quality measurement.
Systematic Testing
Every test case is executed with identical conditions for fair comparison.
Quality Metrics
Automatic scoring across relevance, coherence, completeness, and more.
Version Tracking
Compare results across prompt versions to measure improvement.
Fast Execution
Parallel processing enables evaluation of hundreds of cases in minutes.
Running an Evaluation#
Follow these steps to run a comprehensive evaluation:
Select Your Prompt
Choose a Test Dataset
Configure Settings
Start the Evaluation
Review Results
Processing Time
Evaluation Settings#
Configure these settings before running an evaluation to control execution behavior and scoring:
| Setting | Description | Recommendation |
|---|---|---|
| Model | AI model to use for this evaluation | Use the same model you'll use in production |
| Temperature | Response randomness (0-1) | Lower values (0.3-0.5) for more consistent, reproducible results |
| Max Tokens | Maximum response length | Set based on expected output length plus 20% buffer |
| Parallel Requests | Number of concurrent API calls | 3-5 balances speed and reliability; higher risks rate limits |
| Request Timeout | Maximum wait time per request | 30-60 seconds for complex prompts; increase for long outputs |
| Retry Failed | Automatically retry failed requests | Enable with 2-3 retries and exponential backoff |
| Auto-Score | Enable automatic quality scoring | Yes, unless you prefer manual review |
| Score Model | Model used for scoring (if different) | Use a capable model like GPT-4 for accurate scoring |
{
"model": "gpt-4-turbo",
"temperature": 0.4,
"maxTokens": 2000,
"parallelRequests": 5,
"timeoutMs": 45000,
"retryConfig": {
"maxRetries": 3,
"backoffMultiplier": 2
},
"scoring": {
"enabled": true,
"model": "gpt-4-turbo",
"metrics": ["relevance", "coherence", "completeness", "accuracy", "tone"]
}
}Parallel Processing#
PromptReports processes multiple test cases simultaneously to reduce evaluation time. Configure parallelization based on your API rate limits and reliability requirements.
Speed vs. Reliability
Higher parallelism is faster but increases risk of rate limiting or errors.
Automatic Retry
Failed requests are automatically retried with exponential backoff.
Batch Ordering
Cases are processed in parallel but results maintain original order.
Rate Limit Awareness
Built-in throttling respects API rate limits to prevent failures.
| Parallel Requests | Speed | Reliability | Best For |
|---|---|---|---|
| 1-2 | Slow | Highest | Testing settings, debugging issues |
| 3-5 | Medium | High | Standard evaluations, most use cases |
| 6-10 | Fast | Medium | Large datasets, high API limits |
| 10+ | Fastest | Lower | Enterprise accounts with high limits |
Understanding Results#
After an evaluation completes, you'll see a comprehensive results dashboard:
| Metric | Description | Healthy Range |
|---|---|---|
| Overall Score | Weighted average across all quality metrics | 4.0+ for production-ready prompts |
| Pass Rate | Percentage of cases meeting quality threshold (default: 4.0) | 90%+ for critical customer-facing prompts |
| Avg Response Time | Mean response latency across all cases | Under 5 seconds for interactive use |
| Token Usage | Total input/output tokens consumed | Monitor for cost optimization |
| Error Rate | Percentage of failed API requests | Under 1% with retries enabled |
| Score Variance | Standard deviation of quality scores | Low variance indicates consistent behavior |
Score Distribution
Histogram showing how scores are distributed across test cases.
Category Breakdown
See average scores per category to identify problem areas.
Low Score Alerts
Automatic highlighting of cases scoring below threshold.
Performance Metrics
Latency percentiles (p50, p90, p99) for understanding response times.
Focus on Weak Points
Reviewing Individual Outputs#
Drill down into individual test case results to understand quality patterns and identify specific improvements:
Full Output View
See the complete response with syntax highlighting and formatting.
Score Breakdown
View individual metric scores with detailed scoring rationale.
Input Context
See the exact variable values and context used for each case.
Performance Data
Response time and token counts for each individual execution.
For each test case, you can:
- Compare to Expected: Side-by-side view of actual vs. expected output
- View Scoring Rationale: Detailed explanation of why each metric received its score
- Flag for Review: Mark cases that need human review or prompt adjustment
- Add Notes: Document observations for future reference
- Re-run Single Case: Execute just this case with different settings
Comparing Evaluations#
Compare evaluation results across different prompt versions, models, or settings to make data-driven decisions:
Select Evaluations to Compare
View Side-by-Side Metrics
Analyze Case-by-Case Differences
Generate Comparison Report
Statistical Significance
Regression Testing#
Regression testing automatically compares new prompt versions against a baseline to prevent quality degradation:
Baseline Comparison
New versions are compared against the current production baseline.
Regression Detection
Automatic alerts when quality drops below the baseline threshold.
Quality Gates
Block promotions if quality regresses more than the allowed percentage.
Promotion Blocking
Prevent deployment of versions that fail regression tests.
| Threshold | Behavior | Use Case |
|---|---|---|
| 0% (No Regression) | Block if any metric is lower | Critical customer-facing prompts |
| 5% (Default) | Block if overall score drops more than 5% | Standard production prompts |
| 10% (Lenient) | Allow small regressions for faster iteration | Internal tools, experimentation |
| Custom per Metric | Different thresholds for each quality metric | Nuanced quality requirements |
To set up regression testing:
- Navigate to your prompt's Settings tab
- Enable Regression Testing
- Select a baseline version (usually the current production version)
- Choose a test dataset to use for regression checks
- Set your regression threshold (default: 5%)
- Optionally enable automatic regression testing on version creation
Exporting Results#
Export evaluation results for further analysis, reporting, or integration with external tools:
| Format | Contents | Best For |
|---|---|---|
| CSV | All test cases with inputs, outputs, and scores | Spreadsheet analysis, sharing with stakeholders |
| JSON | Complete evaluation data including metadata | Integration with other tools, programmatic analysis |
| PDF Report | Formatted summary with charts and key metrics | Executive summaries, documentation, audits |
| HTML Report | Interactive report for web viewing | Sharing via email or internal portals |
{
"evaluationId": "eval_abc123",
"promptId": "prompt_xyz",
"promptVersion": 5,
"datasetId": "dataset_456",
"datasetVersion": 2,
"runAt": "2024-01-15T10:30:00Z",
"config": { /* evaluation settings */ },
"summary": {
"totalCases": 100,
"passRate": 0.92,
"overallScore": 4.23,
"avgLatencyMs": 2340,
"totalTokens": 125000
},
"results": [
{
"caseId": "case_1",
"input": { /* input variables */ },
"output": "...",
"expectedOutput": "...",
"scores": { "relevance": 4, "coherence": 5, /* ... */ },
"latencyMs": 2100,
"tokensUsed": 1250
}
/* ... more cases */
]
}Scheduling Evaluations#
Schedule evaluations to run automatically for continuous quality monitoring:
Periodic Runs
Schedule daily, weekly, or monthly evaluations on production prompts.
Quality Alerts
Get notified if scheduled evaluations detect quality degradation.
Trend Analysis
Track quality metrics over time to spot gradual degradation.
Auto-Baseline
Automatically update baselines when quality improves.
Scheduled evaluations are especially useful for:
- Model Updates: Detect if upstream model changes affect your prompt's quality
- Long-Term Monitoring: Track quality trends over weeks or months
- Compliance: Maintain audit trails of quality measurements
- SLA Monitoring: Ensure prompts consistently meet quality requirements