Test Datasets

Create and manage test datasets for systematic prompt evaluation, regression testing, and quality assurance in PromptReports.

What are Test Datasets?#

Test datasets are collections of input-output pairs used to systematically evaluate prompt quality. They form the foundation of professional prompt testing in PromptReports, enabling reproducible evaluations, regression detection, and data-driven prompt optimization.

A well-designed test dataset provides:

Structured Testing

Run prompts against consistent, predefined inputs for reliable comparison across versions.

Regression Prevention

Detect quality drops before they reach production by testing against established baselines.

Categorized Analysis

Filter and analyze results by test case categories, difficulty levels, or custom tags.

Batch Processing

Evaluate hundreds or thousands of cases in a single automated run.

Datasets vs. Playgrounds

While the Playground is great for ad-hoc testing with individual inputs, datasets enable systematic evaluation at scale. Use Playgrounds for exploration and iteration, then create datasets to validate quality before promotion.

Creating Datasets#

PromptReports offers multiple ways to create test datasets, each suited to different workflows:

Manual Entry

Add test cases one by one through the UI. Navigate to your prompt, select the Evaluation tab, click New Dataset, then add rows individually. Best for small, carefully curated datasets where each case is deliberately chosen.

CSV Import

Upload test cases from spreadsheets or external sources. Simply drag and drop a CSV file or click to upload. Column names are automatically mapped to prompt variables. Best for bulk data migration or when data exists in other systems.

From Execution History

Convert past Playground executions or production runs into test cases. PromptReports can sample from your execution history, including both inputs and actual outputs. Great for capturing real-world examples and building regression test suites.

AI-Generated

Use AI to generate synthetic test cases based on your prompt description and requirements. Specify the types of inputs you want (edge cases, typical usage, adversarial inputs) and let the system create diverse, representative test data automatically.

Public Templates

Start from community-created dataset templates and customize for your needs. Browse templates by category (customer support, content generation, analysis, etc.) and copy them to your workspace.

Start Small, Grow Intentionally

Begin with 20-50 high-quality test cases covering your most important scenarios. Add cases purposefully when you discover new edge cases or failure modes. Quality matters more than quantity.

Dataset Structure#

Each dataset consists of rows (test cases) with the following fields:

Field	Required	Description
Input Variables	Yes	JSON object with values for each prompt variable (must match your template)
Expected Output	No	Reference output for automated comparison and scoring
Category	No	Classification label for filtered analysis (e.g., "complaints", "inquiries")
Tags	No	Additional labels for fine-grained filtering (e.g., "edge-case", "high-priority")
Difficulty	No	Easy, Medium, or Hard classification for weighted analysis
Priority	No	Numeric weight (1-5) for importance in aggregate scoring
Notes	No	Free-form text for context or special handling instructions

Example Dataset Row (JSON)

json

{
  "inputVars": {
    "customer_message": "I need to return this product, it's been 3 weeks",
    "customer_type": "premium",
    "product_category": "electronics",
    "order_id": "ORD-2024-98765"
  },
  "expectedOutput": "I understand you'd like to return your electronics item. As a premium customer, you have a 90-day return window, so you're still well within the return period. I'll process this return for order ORD-2024-98765 right away...",
  "category": "returns",
  "tags": ["premium", "within-policy", "electronics"],
  "difficulty": "easy",
  "priority": 3,
  "notes": "Standard return request, should reference return policy timeframe"
}

Importing Data#

Import datasets from external files for bulk test case creation. PromptReports supports multiple formats:

CSV Files

Standard comma-separated values with headers. Most common format.

JSON

Array of objects with key-value pairs. Best for complex nested data.

Excel

Upload .xlsx files directly with automatic sheet detection.

Example CSV Format

csv

customer_message,customer_type,product_category,expected_output,category,difficulty
"I need a refund",standard,clothing,"We can process your refund within 3-5 business days...",refunds,easy
"When will my order arrive?",premium,electronics,"Let me check your order status. As a premium member...",shipping,easy
"This item is broken and I want to speak to a manager",standard,furniture,"I'm so sorry about the damage. I'll connect you with...",complaints,hard
"Can I change my delivery address?",standard,clothing,"Yes, I can update your delivery address...",modifications,medium

Column names should match your prompt variables. The system will automatically map them during import. You can also manually adjust mappings if column names don't exactly match your variable names.

CSV Encoding

Ensure your CSV file is UTF-8 encoded to avoid issues with special characters. Most spreadsheet applications support "Save as CSV (UTF-8)" option. If you see garbled characters after import, re-export with UTF-8 encoding.

AI-Generated Datasets#

PromptReports can automatically generate diverse test cases using AI. This is especially useful when you need to quickly build out test coverage or explore edge cases you might not have considered.

Describe Your Use Case

Provide a description of what your prompt does and the types of inputs it handles.

Specify Categories

Define the categories of test cases you want: typical usage, edge cases, adversarial inputs, etc.

Set Distribution

Choose how many test cases to generate and the distribution across categories.

Review and Curate

Review generated cases, remove irrelevant ones, and edit as needed before saving.

AI Generation Request

json

{
  "promptDescription": "Customer support chatbot for e-commerce returns and refunds",
  "categories": {
    "typical": 20,
    "edge_cases": 10,
    "adversarial": 5,
    "multi_language": 5
  },
  "constraints": [
    "Include various customer sentiment levels",
    "Mix premium and standard customers",
    "Cover all product categories",
    "Include some requests that should be declined"
  ]
}

Combine Methods

AI-generated datasets work best when combined with real-world data. Use AI to quickly build initial coverage, then supplement with cases from execution history and manually created edge cases for comprehensive testing.

From Execution History#

Build test datasets from real production data by sampling from your prompt's execution history. This ensures your test cases reflect actual usage patterns.

Real-World Coverage

Test cases based on actual user inputs, not hypothetical scenarios.

Smart Sampling

Automatically sample diverse inputs across categories and time periods.

Capture Good Examples

Mark high-quality outputs as expected results for future testing.

Learn from Failures

Include past failures as test cases to prevent regression.

To create a dataset from history:

Navigate to your prompt's Evaluation tab
Click New Dataset and select From History
Choose a date range and sampling strategy (random, stratified by category, or focused on specific outcomes)
Review the selected executions and optionally mark outputs as expected results
Save the dataset with a descriptive name

Dataset Versioning#

Like prompts, datasets are versioned to maintain reproducibility and enable historical comparisons. When you modify a dataset, PromptReports creates a new version while preserving the original.

Action	Creates New Version	Notes
Add test cases	Yes	New rows appear in the latest version only
Edit existing cases	Yes	Modified rows are tracked with change history
Delete cases	Yes	Deleted rows are soft-deleted and retained in version history
Reorder cases	No	Order changes don't affect reproducibility
Update tags/metadata	No	Metadata changes are applied immediately across versions
Change settings	No	Configuration changes apply to all future evaluations

Version History

View all versions of a dataset and see exactly what changed between them.

Pin to Version

Lock evaluations to a specific dataset version for reproducible results.

Pinning for Regression Testing

When setting up regression testing baselines, always pin to a specific dataset version. This ensures that future dataset changes don't affect your ability to reproduce past results or invalidate your baseline comparisons.

Managing Datasets#

PromptReports provides comprehensive tools for organizing and maintaining your test datasets:

Filter & Search

Find specific test cases using tags, categories, or full-text search across inputs and outputs.

Clone Datasets

Duplicate existing datasets as a starting point for variations or experiments.

Bulk Operations

Add, remove, tag, or update multiple test cases at once with batch actions.

Dataset Settings

Configure default metrics, scoring weights, and evaluation parameters per dataset.

Export Datasets

Download datasets as CSV or JSON for backup, sharing, or external analysis.

Sync with Source

Re-import updated data from connected external sources automatically.

Each dataset can be associated with one or more prompts. You can also create "shared" datasets that apply across multiple prompts—useful when testing prompts that should handle similar types of inputs consistently.

Best Practices#

Follow these guidelines for effective test datasets:

Cover the Full Spectrum

Include easy, medium, and hard cases. Don't just test happy paths—include edge cases, boundary conditions, and potentially problematic inputs that might cause unexpected behavior.

Use Representative Data

Your test data should reflect real-world usage patterns. If 60% of production queries are simple inquiries, your dataset should have similar proportions. Use execution history to ensure representativeness.

Include Adversarial Cases

Add test cases specifically designed to break your prompt: ambiguous inputs, contradictory requests, attempts to bypass instructions, and inputs in unexpected formats.

Categorize Thoroughly

Use categories and tags liberally to enable filtered analysis. Common dimensions: category, difficulty, source, language, customer type, and domain-specific labels.

Document Expected Outputs

When possible, include expected outputs or at least expected characteristics. This enables automated scoring and makes it clear what "correct" looks like for each case.

Maintain Actively

Review and update datasets regularly. Remove outdated cases, update expected outputs when requirements change, and add new cases when you discover failure modes.

Avoid Overfitting

Don't optimize your prompts solely to pass test cases. Use datasets as a quality indicator, but always validate with real users and production monitoring. If you find yourself crafting prompts specifically to pass tests, your dataset may be too narrow or your expected outputs too rigid.

With well-structured datasets in place, you're ready to start running evaluations to measure and improve prompt quality systematically.

PreviousOverview

NextRunning Evaluations