Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
130 changes: 83 additions & 47 deletions src/pages/docs/dataset/features/experiments.mdx
Original file line number Diff line number Diff line change
@@ -1,97 +1,133 @@
---
title: "Experiments in Dataset"
description: "To test, validate, and compare different prompt configurations"
description: "Test, validate, and compare prompt and agent configurations side by side"
---

## About

Experiments give you a structured way to answer questions like: *Which prompt performs better? Which model gives the best results for my use case?* You test different prompt and model combinations on the same dataset, score the outputs with evals, and compare results side by side so you can make data-driven decisions instead of guessing.
Experiments give you a structured way to answer questions like: *Which prompt performs better? Which model gives the best results? Does my agent beat my prompt for this task?* You import prompts and agents, run them across multiple model and parameter configurations on the same dataset, score the outputs with evals, and compare results side by side so you can make data-driven decisions instead of guessing.

## When to use

- **Compare prompts**: Run different prompt templates on the same rows and see which produces better answers or scores.
- **Compare models**: Run the same prompt with multiple models (or custom models) and compare quality, speed, or cost.
- **Validate before rollout**: Test prompt and model changes on a dataset before using them in production.
- **Optimize with evals**: Add built-in or custom evals and use scores to rank prompt/model combinations and pick a winner.
- **Compare prompts and agents**: Pull prompts from the [Prompt](/docs/prompt) section and agents from the [Agent Playground](/docs/agent-playground) into the same experiment and see which produces better outputs.
- **Compare models and parameters**: Add the same prompt with multiple models, temperatures, or tool configs to compare quality, latency, and cost across configurations.
- **Validate before rollout**: Test a prompt or agent change on a dataset before promoting it to production.
- **Optimize with evals**: Attach built-in or custom evals and use scores to rank prompt/agent-model combinations and pick a winner.
- **Iterate fast**: Stop a long run, edit a single config, or rerun just the failed cells without restarting the whole experiment.

## How to

You pick a **base column** (the generated responses you want to compare against), add one or more **prompt templates** (each with one or more models), attach **evals**, and run. The system generates responses for each prompt–model pair, runs the evals, and surfaces scores and comparisons so you can choose the best setup.
Experiment creation is a guided three-step flow: **Basic Info → Configuration → Evaluations**. Each step validates before you can move forward, and you can jump back to any completed step to edit it.

<Steps>
<Step title="Navigate to Experiments">
Click the "Experiments" button (e.g. in the top-right on the dataset dashboard) to open experiments for this dataset.
Open the dataset and click the **Experiments** button in the top-right of the dataset dashboard.
![Experiments](/screenshot/product/dataset/how-to/experiments-in-dataset/1.png)
</Step>

<Step title="Create a new experiment">
Give the experiment a name and select the **base column** – the column whose generated responses you want to compare (e.g. an existing run-prompt column). All experiment runs will be evaluated and compared against this baseline.
<Step title="Step 1: Basic Info">
Give the experiment a **name** and pick the **experiment type**.

The name Set up the prompt and model configurations you want to compare. Each configuration becomes a separate column in the experiment grid. is pre-filled with an auto-suggested name based on your dataset. Accept it as-is or overwrite it with your own. Names must be unique within the dataset.

Pick the experiment type that matches the task you're testing:

<Tabs>
<Tab title="LLM" icon="robot">
Use **LLM** for text generation. You can import prompts *and* agents in the same experiment.
</Tab>
<Tab title="Text-to-Speech (TTS)" icon="microphone">
Use **TTS** to generate audio from text. Add prompts with different voices, models, and parameters to compare.
</Tab>
<Tab title="Speech-to-Text (STT)" icon="page">
Use **STT** to transcribe audio. Each prompt configuration must point at a dataset column containing the input audio.
</Tab>
<Tab title="Image Generation" icon="image">
Use **Image Generation** to create images from text (or text + image). Compare image models and prompts side by side.
</Tab>
</Tabs>

![Create Experiment](/screenshot/product/dataset/how-to/experiments-in-dataset/2.png)
</Step>

<Step title="Prompt template">
In the prompt template section, define the prompts and models for the experiment. You can add multiple prompt templates; each can use one or more models so you compare many combinations.
![Prompt Template](/screenshot/product/dataset/how-to/experiments-in-dataset/3.png)
<Step title="Step 2: Configuration">
Set up the prompt and model configurations you want to compare. Each configuration becomes a separate column in the experiment grid.


Choose the model type and model(s) you want for the experiment. You can select multiple models to compare. You can also create a custom model via "Create Custom Model".
<Tabs>
<Tab title="LLM" icon="robot">
Select **LLM** for text generation (chat). Choose one or more chat models to compare prompt performance.
![LLM](/screenshot/product/dataset/how-to/experiments-in-dataset/4.png)
<Tip>
Click [here](/docs/evaluation/features/custom-models) to learn how to create a custom model.
</Tip>
For LLM experiments, click **Add Prompt/Agents** to import a prompt or agent. You can mix prompts and agents in the same experiment and score them against the same evals.

- **Prompts**: pick a prompt from the [Prompt](/docs/prompt) section, select a published version, then attach **one or more models**. Each (prompt, model) pair becomes its own configuration, so adding three models to one prompt creates three columns to compare. For each model you can tune temperature, max tokens, top-p, response format, and tool config.
- **Agents**: pick an agent from the [Agent Playground](/docs/agent-playground) and select a published version. The agent's model, tools, and graph are captured at that version, so the run stays reproducible even if the agent is edited later. You don't pick a model again here.
![LLM](/screenshot/product/dataset/how-to/experiments-in-dataset/3.png)
</Tab>
<Tab title="Text-to-Speech (TTS)" icon="microphone">
Select **Text-to-Speech** to generate audio from text. Choose TTS models to compare voice output across prompts.
![TTS](/screenshot/product/dataset/how-to/experiments-in-dataset/5.png)
<Tip>
Click [here](/docs/evaluation/features/custom-models) to learn how to create a custom model.
</Tip>
For each prompt, write the instructions inline (use `{{column_name}}` to reference dataset columns) and attach one or more **TTS models** (with voice and format settings). Click **+ Add Prompt** to add more prompt entries. Each (prompt, model) pair becomes its own column. Output format is fixed to Audio.
![TTS](/screenshot/product/dataset/how-to/experiments-in-dataset/4.png)
</Tab>
<Tab title="Speech-to-Text (STT)" icon="page">
Select **Speech-to-Text** to transcribe audio into text. Choose STT models to compare transcription quality.
![STT](/screenshot/product/dataset/how-to/experiments-in-dataset/6.png)
<Tip>
Click [here](/docs/evaluation/features/custom-models) to learn how to create a custom model.
</Tip>
For each prompt, write the instructions inline (use `{{column_name}}` to reference dataset columns), pick the dataset column containing the input audio, and attach one or more **STT models**. Click **+ Add Prompt** to add more entries to compare transcription quality.
![STT](/screenshot/product/dataset/how-to/experiments-in-dataset/5.png)
</Tab>
<Tab title="Image Generation" icon="image">
Select **Image Generation** to create images from text (or image + text). Choose image models to compare output quality.
![Image Generation](/screenshot/product/dataset/how-to/experiments-in-dataset/7.png)
For each prompt, write the instructions inline (use `{{column_name}}` to reference dataset columns) and attach one or more **image models**. Click **+ Add Prompt** to add more entries and compare output quality across models and parameters.
![Image Generation](/screenshot/product/dataset/how-to/experiments-in-dataset/6.png)
</Tab>
<Tab title="Custom models" icon="gear">
Models you've added through Custom Models show up in the model picker for prompt configurations across all experiment types.
<Tip>
Click [here](/docs/evaluation/features/custom-models) to learn how to create a custom model.
See [Custom Models](/docs/evaluation/features/custom-models) for how to register a custom or self-hosted model.
</Tip>
</Tab>
</Tabs>

Use an existing prompt template or create a new one. You can add as many prompt templates as you need.
<Tip>
Click [here](/docs/prompt-workbench) to learn more about prompts.
</Tip>
For prompts, you can also configure **tool calling** with **Auto**, **Required**, or **None**, and add tool definitions the model can invoke.
</Step>

<Step title="Choosing evals">
Experiments compare prompt–model performance using evals. Add the evals you want to run on the generated responses.
<Step title="Step 3: Evaluations">
The final step has two parts: an optional **base column** and the **evals** you want to score outputs with.

**Compare against baseline (optional)**: pick a column from the dataset to compare model outputs against (typically a ground-truth or existing run-prompt column). Skip it if you don't have a reference output yet; you can still run the experiment, attach evals that don't need a baseline, and add a base column later by editing the experiment.

**Add evaluations**: click **Add Evaluation** and pick from the [built-in eval](/docs/evaluation/builtin) catalog or [create a custom eval](/docs/evaluation/features/custom). Add as many as you need. Every eval runs on every configuration so the results are directly comparable.
![Choosing Evals](/screenshot/product/dataset/how-to/experiments-in-dataset/7.png)

For each eval, map its inputs (e.g. `output`, `input`, `expected`) to the model output or to dataset columns. Mapping is required before the experiment can run.
![Choosing Evals](/screenshot/product/dataset/how-to/experiments-in-dataset/8.png)
Click "Add Evaluation" and pick from [existing eval](/docs/evaluation/builtin) templates or [create a custom eval](/docs/evaluation/features/custom). You can add as many evals as you want.
![Choosing Evals](/screenshot/product/dataset/how-to/experiments-in-dataset/9.png)
</Step>

<Step title="Run experiment">
After configuring prompts, models, and evals, click "Run" to start the experiment. The system will generate responses for each prompt–model pair, run the evals, and show results and comparisons when complete.
<Step title="Run the experiment">
Click **Run** to start. The experiment processes every row across every prompt/agent-model configuration in parallel, running the evals on each output as it arrives. The grid streams results live so you can watch progress without waiting for the whole run to finish.
</Step>

<Step title="Stop a running experiment">
If you spot a misconfiguration or want to abort, click **Stop** on a running experiment from the Experiments tab. Any in-flight cells are marked as errored, and you can then edit the experiment and rerun without waiting for the full run to complete.
</Step>

<Step title="Update and re-run">
You can change the experiment at any time: edit the name, base column, prompt templates, models, or evals, then save. Use **Re-run** to run the experiment again with the same or updated config (e.g. after adding rows to the dataset or changing a prompt). Re-run processes all rows again and refreshes the experiment dataset results.
![Update](/screenshot/product/dataset/how-to/experiments-in-dataset/10.png)
<Step title="Edit and rerun">
Use **Rerun Experiment** to re-execute the entire experiment after editing prompts, models, evals, or the base column. Editing is granular: only the configurations you actually changed are re-executed, and results from untouched configurations are preserved.

For more targeted reruns:

- **Rerun a single cell**: hover any output or eval cell in the grid and click the rerun icon. Useful when one row failed transiently or you've tweaked a single configuration.
- **Rerun a column**: from the column header, choose **Run all cells in the column** or **Run only failed cells in the column**. Failed-only is the fastest way to recover from API hiccups without redoing successful work.
- **Rerun an eval**: re-execute a single eval across all rows after changing its config or mapping, without re-generating any model outputs.

![Update](/screenshot/product/dataset/how-to/experiments-in-dataset/9.png)
</Step>

<Step title="Compare results">
When the experiment has finished, use the **Compare** (or comparison) view to see how each prompt–model combination performed. Set weights for eval scores and metrics (e.g. response time, token usage) to compute an overall ranking. The comparison shows which combination ranks best so you can choose a winner.
<Step title="Compare results and choose a winner">
Open the **Compare** view to see how every configuration performed. Set weights (0-10) for each eval score and for response time, completion tokens, and total tokens. The system normalizes the metrics, computes an overall rating per configuration, and ranks them so the winner is clear. Adjust the weights to match what matters for your use case (e.g. prioritize quality over cost) and the ranking updates in place.
</Step>
</Steps>

## Tips

- **Use published versions**: experiments only run published prompt and agent versions. Publish the version you want to test before importing it.
- **Mix prompts and agents**: an **LLM** experiment can contain prompts and agents side by side, scored against the same evals. Useful when you're deciding whether an agent is worth the extra complexity over a prompt. TTS, STT, and Image experiments accept prompts only.
- **Failed-only rerun**: when transient failures (rate limits, network blips) leave a few cells errored, use the failed-only rerun on the column to recover them without redoing successful rows.

## Next Steps

<CardGroup cols={2}>
Expand Down