Test Datasets
Create and manage test datasets for systematic prompt evaluation, regression testing, and quality assurance in PromptReports.
What are Test Datasets?#
Test datasets are collections of input-output pairs used to systematically evaluate prompt quality. They form the foundation of professional prompt testing in PromptReports, enabling reproducible evaluations, regression detection, and data-driven prompt optimization.
A well-designed test dataset provides:
Structured Testing
Run prompts against consistent, predefined inputs for reliable comparison across versions.
Regression Prevention
Detect quality drops before they reach production by testing against established baselines.
Categorized Analysis
Filter and analyze results by test case categories, difficulty levels, or custom tags.
Batch Processing
Evaluate hundreds or thousands of cases in a single automated run.
Datasets vs. Playgrounds
Creating Datasets#
PromptReports offers multiple ways to create test datasets, each suited to different workflows:
Manual Entry
CSV Import
From Execution History
AI-Generated
Public Templates
Start Small, Grow Intentionally
Dataset Structure#
Each dataset consists of rows (test cases) with the following fields:
| Field | Required | Description |
|---|---|---|
| Input Variables | Yes | JSON object with values for each prompt variable (must match your template) |
| Expected Output | No | Reference output for automated comparison and scoring |
| Category | No | Classification label for filtered analysis (e.g., "complaints", "inquiries") |
| Tags | No | Additional labels for fine-grained filtering (e.g., "edge-case", "high-priority") |
| Difficulty | No | Easy, Medium, or Hard classification for weighted analysis |
| Priority | No | Numeric weight (1-5) for importance in aggregate scoring |
| Notes | No | Free-form text for context or special handling instructions |
{
"inputVars": {
"customer_message": "I need to return this product, it's been 3 weeks",
"customer_type": "premium",
"product_category": "electronics",
"order_id": "ORD-2024-98765"
},
"expectedOutput": "I understand you'd like to return your electronics item. As a premium customer, you have a 90-day return window, so you're still well within the return period. I'll process this return for order ORD-2024-98765 right away...",
"category": "returns",
"tags": ["premium", "within-policy", "electronics"],
"difficulty": "easy",
"priority": 3,
"notes": "Standard return request, should reference return policy timeframe"
}Importing Data#
Import datasets from external files for bulk test case creation. PromptReports supports multiple formats:
CSV Files
Standard comma-separated values with headers. Most common format.
JSON
Array of objects with key-value pairs. Best for complex nested data.
Excel
Upload .xlsx files directly with automatic sheet detection.
customer_message,customer_type,product_category,expected_output,category,difficulty
"I need a refund",standard,clothing,"We can process your refund within 3-5 business days...",refunds,easy
"When will my order arrive?",premium,electronics,"Let me check your order status. As a premium member...",shipping,easy
"This item is broken and I want to speak to a manager",standard,furniture,"I'm so sorry about the damage. I'll connect you with...",complaints,hard
"Can I change my delivery address?",standard,clothing,"Yes, I can update your delivery address...",modifications,mediumColumn names should match your prompt variables. The system will automatically map them during import. You can also manually adjust mappings if column names don't exactly match your variable names.
CSV Encoding
AI-Generated Datasets#
PromptReports can automatically generate diverse test cases using AI. This is especially useful when you need to quickly build out test coverage or explore edge cases you might not have considered.
Describe Your Use Case
Specify Categories
Set Distribution
Review and Curate
{
"promptDescription": "Customer support chatbot for e-commerce returns and refunds",
"categories": {
"typical": 20,
"edge_cases": 10,
"adversarial": 5,
"multi_language": 5
},
"constraints": [
"Include various customer sentiment levels",
"Mix premium and standard customers",
"Cover all product categories",
"Include some requests that should be declined"
]
}Combine Methods
From Execution History#
Build test datasets from real production data by sampling from your prompt's execution history. This ensures your test cases reflect actual usage patterns.
Real-World Coverage
Test cases based on actual user inputs, not hypothetical scenarios.
Smart Sampling
Automatically sample diverse inputs across categories and time periods.
Capture Good Examples
Mark high-quality outputs as expected results for future testing.
Learn from Failures
Include past failures as test cases to prevent regression.
To create a dataset from history:
- Navigate to your prompt's Evaluation tab
- Click New Dataset and select From History
- Choose a date range and sampling strategy (random, stratified by category, or focused on specific outcomes)
- Review the selected executions and optionally mark outputs as expected results
- Save the dataset with a descriptive name
Dataset Versioning#
Like prompts, datasets are versioned to maintain reproducibility and enable historical comparisons. When you modify a dataset, PromptReports creates a new version while preserving the original.
| Action | Creates New Version | Notes |
|---|---|---|
| Add test cases | Yes | New rows appear in the latest version only |
| Edit existing cases | Yes | Modified rows are tracked with change history |
| Delete cases | Yes | Deleted rows are soft-deleted and retained in version history |
| Reorder cases | No | Order changes don't affect reproducibility |
| Update tags/metadata | No | Metadata changes are applied immediately across versions |
| Change settings | No | Configuration changes apply to all future evaluations |
Version History
View all versions of a dataset and see exactly what changed between them.
Pin to Version
Lock evaluations to a specific dataset version for reproducible results.
Pinning for Regression Testing
Managing Datasets#
PromptReports provides comprehensive tools for organizing and maintaining your test datasets:
Filter & Search
Find specific test cases using tags, categories, or full-text search across inputs and outputs.
Clone Datasets
Duplicate existing datasets as a starting point for variations or experiments.
Bulk Operations
Add, remove, tag, or update multiple test cases at once with batch actions.
Dataset Settings
Configure default metrics, scoring weights, and evaluation parameters per dataset.
Export Datasets
Download datasets as CSV or JSON for backup, sharing, or external analysis.
Sync with Source
Re-import updated data from connected external sources automatically.
Each dataset can be associated with one or more prompts. You can also create "shared" datasets that apply across multiple prompts—useful when testing prompts that should handle similar types of inputs consistently.
Best Practices#
Follow these guidelines for effective test datasets:
Cover the Full Spectrum
Use Representative Data
Include Adversarial Cases
Categorize Thoroughly
Document Expected Outputs
Maintain Actively
Avoid Overfitting
With well-structured datasets in place, you're ready to start running evaluations to measure and improve prompt quality systematically.