diff --git a/public/screenshot/product/dataset/how-to/experiments-in-dataset/2.png b/public/screenshot/product/dataset/how-to/experiments-in-dataset/2.png index 73210f02..58e3c804 100644 Binary files a/public/screenshot/product/dataset/how-to/experiments-in-dataset/2.png and b/public/screenshot/product/dataset/how-to/experiments-in-dataset/2.png differ diff --git a/public/screenshot/product/dataset/how-to/experiments-in-dataset/3.png b/public/screenshot/product/dataset/how-to/experiments-in-dataset/3.png index 19ae84cb..a4cd484e 100644 Binary files a/public/screenshot/product/dataset/how-to/experiments-in-dataset/3.png and b/public/screenshot/product/dataset/how-to/experiments-in-dataset/3.png differ diff --git a/public/screenshot/product/dataset/how-to/experiments-in-dataset/4.png b/public/screenshot/product/dataset/how-to/experiments-in-dataset/4.png index a5e7602b..c617c25c 100644 Binary files a/public/screenshot/product/dataset/how-to/experiments-in-dataset/4.png and b/public/screenshot/product/dataset/how-to/experiments-in-dataset/4.png differ diff --git a/public/screenshot/product/dataset/how-to/experiments-in-dataset/5.png b/public/screenshot/product/dataset/how-to/experiments-in-dataset/5.png index 12ddf231..a239538d 100644 Binary files a/public/screenshot/product/dataset/how-to/experiments-in-dataset/5.png and b/public/screenshot/product/dataset/how-to/experiments-in-dataset/5.png differ diff --git a/public/screenshot/product/dataset/how-to/experiments-in-dataset/6.png b/public/screenshot/product/dataset/how-to/experiments-in-dataset/6.png index c10714ab..2eec4f4b 100644 Binary files a/public/screenshot/product/dataset/how-to/experiments-in-dataset/6.png and b/public/screenshot/product/dataset/how-to/experiments-in-dataset/6.png differ diff --git a/public/screenshot/product/dataset/how-to/experiments-in-dataset/7.png b/public/screenshot/product/dataset/how-to/experiments-in-dataset/7.png index 72fe6ad0..29510fed 100644 Binary files a/public/screenshot/product/dataset/how-to/experiments-in-dataset/7.png and b/public/screenshot/product/dataset/how-to/experiments-in-dataset/7.png differ diff --git a/public/screenshot/product/dataset/how-to/experiments-in-dataset/8.png b/public/screenshot/product/dataset/how-to/experiments-in-dataset/8.png index 1366a68e..c2b950b2 100644 Binary files a/public/screenshot/product/dataset/how-to/experiments-in-dataset/8.png and b/public/screenshot/product/dataset/how-to/experiments-in-dataset/8.png differ diff --git a/public/screenshot/product/dataset/how-to/experiments-in-dataset/9.png b/public/screenshot/product/dataset/how-to/experiments-in-dataset/9.png index eebc3ab3..9952a0ff 100644 Binary files a/public/screenshot/product/dataset/how-to/experiments-in-dataset/9.png and b/public/screenshot/product/dataset/how-to/experiments-in-dataset/9.png differ diff --git a/src/pages/docs/dataset/features/experiments.mdx b/src/pages/docs/dataset/features/experiments.mdx index 5d81367f..c4387b1c 100644 --- a/src/pages/docs/dataset/features/experiments.mdx +++ b/src/pages/docs/dataset/features/experiments.mdx @@ -1,97 +1,133 @@ --- title: "Experiments in Dataset" -description: "To test, validate, and compare different prompt configurations" +description: "Test, validate, and compare prompt and agent configurations side by side" --- ## About -Experiments give you a structured way to answer questions like: *Which prompt performs better? Which model gives the best results for my use case?* You test different prompt and model combinations on the same dataset, score the outputs with evals, and compare results side by side so you can make data-driven decisions instead of guessing. +Experiments give you a structured way to answer questions like: *Which prompt performs better? Which model gives the best results? Does my agent beat my prompt for this task?* You import prompts and agents, run them across multiple model and parameter configurations on the same dataset, score the outputs with evals, and compare results side by side so you can make data-driven decisions instead of guessing. ## When to use -- **Compare prompts**: Run different prompt templates on the same rows and see which produces better answers or scores. -- **Compare models**: Run the same prompt with multiple models (or custom models) and compare quality, speed, or cost. -- **Validate before rollout**: Test prompt and model changes on a dataset before using them in production. -- **Optimize with evals**: Add built-in or custom evals and use scores to rank prompt/model combinations and pick a winner. +- **Compare prompts and agents**: Pull prompts from the [Prompt](/docs/prompt) section and agents from the [Agent Playground](/docs/agent-playground) into the same experiment and see which produces better outputs. +- **Compare models and parameters**: Add the same prompt with multiple models, temperatures, or tool configs to compare quality, latency, and cost across configurations. +- **Validate before rollout**: Test a prompt or agent change on a dataset before promoting it to production. +- **Optimize with evals**: Attach built-in or custom evals and use scores to rank prompt/agent-model combinations and pick a winner. +- **Iterate fast**: Stop a long run, edit a single config, or rerun just the failed cells without restarting the whole experiment. ## How to -You pick a **base column** (the generated responses you want to compare against), add one or more **prompt templates** (each with one or more models), attach **evals**, and run. The system generates responses for each prompt–model pair, runs the evals, and surfaces scores and comparisons so you can choose the best setup. +Experiment creation is a guided three-step flow: **Basic Info → Configuration → Evaluations**. Each step validates before you can move forward, and you can jump back to any completed step to edit it. - Click the "Experiments" button (e.g. in the top-right on the dataset dashboard) to open experiments for this dataset. + Open the dataset and click the **Experiments** button in the top-right of the dataset dashboard. ![Experiments](/screenshot/product/dataset/how-to/experiments-in-dataset/1.png) - - Give the experiment a name and select the **base column** – the column whose generated responses you want to compare (e.g. an existing run-prompt column). All experiment runs will be evaluated and compared against this baseline. + + Give the experiment a **name** and pick the **experiment type**. + + The name Set up the prompt and model configurations you want to compare. Each configuration becomes a separate column in the experiment grid. is pre-filled with an auto-suggested name based on your dataset. Accept it as-is or overwrite it with your own. Names must be unique within the dataset. + + Pick the experiment type that matches the task you're testing: + + + + Use **LLM** for text generation. You can import prompts *and* agents in the same experiment. + + + Use **TTS** to generate audio from text. Add prompts with different voices, models, and parameters to compare. + + + Use **STT** to transcribe audio. Each prompt configuration must point at a dataset column containing the input audio. + + + Use **Image Generation** to create images from text (or text + image). Compare image models and prompts side by side. + + + ![Create Experiment](/screenshot/product/dataset/how-to/experiments-in-dataset/2.png) - - In the prompt template section, define the prompts and models for the experiment. You can add multiple prompt templates; each can use one or more models so you compare many combinations. - ![Prompt Template](/screenshot/product/dataset/how-to/experiments-in-dataset/3.png) + + Set up the prompt and model configurations you want to compare. Each configuration becomes a separate column in the experiment grid. + - Choose the model type and model(s) you want for the experiment. You can select multiple models to compare. You can also create a custom model via "Create Custom Model". - Select **LLM** for text generation (chat). Choose one or more chat models to compare prompt performance. - ![LLM](/screenshot/product/dataset/how-to/experiments-in-dataset/4.png) - - Click [here](/docs/evaluation/features/custom-models) to learn how to create a custom model. - + For LLM experiments, click **Add Prompt/Agents** to import a prompt or agent. You can mix prompts and agents in the same experiment and score them against the same evals. + + - **Prompts**: pick a prompt from the [Prompt](/docs/prompt) section, select a published version, then attach **one or more models**. Each (prompt, model) pair becomes its own configuration, so adding three models to one prompt creates three columns to compare. For each model you can tune temperature, max tokens, top-p, response format, and tool config. + - **Agents**: pick an agent from the [Agent Playground](/docs/agent-playground) and select a published version. The agent's model, tools, and graph are captured at that version, so the run stays reproducible even if the agent is edited later. You don't pick a model again here. + ![LLM](/screenshot/product/dataset/how-to/experiments-in-dataset/3.png) - Select **Text-to-Speech** to generate audio from text. Choose TTS models to compare voice output across prompts. - ![TTS](/screenshot/product/dataset/how-to/experiments-in-dataset/5.png) - - Click [here](/docs/evaluation/features/custom-models) to learn how to create a custom model. - + For each prompt, write the instructions inline (use `{{column_name}}` to reference dataset columns) and attach one or more **TTS models** (with voice and format settings). Click **+ Add Prompt** to add more prompt entries. Each (prompt, model) pair becomes its own column. Output format is fixed to Audio. + ![TTS](/screenshot/product/dataset/how-to/experiments-in-dataset/4.png) - Select **Speech-to-Text** to transcribe audio into text. Choose STT models to compare transcription quality. - ![STT](/screenshot/product/dataset/how-to/experiments-in-dataset/6.png) - - Click [here](/docs/evaluation/features/custom-models) to learn how to create a custom model. - + For each prompt, write the instructions inline (use `{{column_name}}` to reference dataset columns), pick the dataset column containing the input audio, and attach one or more **STT models**. Click **+ Add Prompt** to add more entries to compare transcription quality. + ![STT](/screenshot/product/dataset/how-to/experiments-in-dataset/5.png) - Select **Image Generation** to create images from text (or image + text). Choose image models to compare output quality. - ![Image Generation](/screenshot/product/dataset/how-to/experiments-in-dataset/7.png) + For each prompt, write the instructions inline (use `{{column_name}}` to reference dataset columns) and attach one or more **image models**. Click **+ Add Prompt** to add more entries and compare output quality across models and parameters. + ![Image Generation](/screenshot/product/dataset/how-to/experiments-in-dataset/6.png) + + + Models you've added through Custom Models show up in the model picker for prompt configurations across all experiment types. - Click [here](/docs/evaluation/features/custom-models) to learn how to create a custom model. + See [Custom Models](/docs/evaluation/features/custom-models) for how to register a custom or self-hosted model. - Use an existing prompt template or create a new one. You can add as many prompt templates as you need. - - Click [here](/docs/prompt-workbench) to learn more about prompts. - + For prompts, you can also configure **tool calling** with **Auto**, **Required**, or **None**, and add tool definitions the model can invoke. - - Experiments compare prompt–model performance using evals. Add the evals you want to run on the generated responses. + + The final step has two parts: an optional **base column** and the **evals** you want to score outputs with. + + **Compare against baseline (optional)**: pick a column from the dataset to compare model outputs against (typically a ground-truth or existing run-prompt column). Skip it if you don't have a reference output yet; you can still run the experiment, attach evals that don't need a baseline, and add a base column later by editing the experiment. + + **Add evaluations**: click **Add Evaluation** and pick from the [built-in eval](/docs/evaluation/builtin) catalog or [create a custom eval](/docs/evaluation/features/custom). Add as many as you need. Every eval runs on every configuration so the results are directly comparable. + ![Choosing Evals](/screenshot/product/dataset/how-to/experiments-in-dataset/7.png) + + For each eval, map its inputs (e.g. `output`, `input`, `expected`) to the model output or to dataset columns. Mapping is required before the experiment can run. ![Choosing Evals](/screenshot/product/dataset/how-to/experiments-in-dataset/8.png) - Click "Add Evaluation" and pick from [existing eval](/docs/evaluation/builtin) templates or [create a custom eval](/docs/evaluation/features/custom). You can add as many evals as you want. - ![Choosing Evals](/screenshot/product/dataset/how-to/experiments-in-dataset/9.png) - - After configuring prompts, models, and evals, click "Run" to start the experiment. The system will generate responses for each prompt–model pair, run the evals, and show results and comparisons when complete. + + Click **Run** to start. The experiment processes every row across every prompt/agent-model configuration in parallel, running the evals on each output as it arrives. The grid streams results live so you can watch progress without waiting for the whole run to finish. + + + + If you spot a misconfiguration or want to abort, click **Stop** on a running experiment from the Experiments tab. Any in-flight cells are marked as errored, and you can then edit the experiment and rerun without waiting for the full run to complete. - - You can change the experiment at any time: edit the name, base column, prompt templates, models, or evals, then save. Use **Re-run** to run the experiment again with the same or updated config (e.g. after adding rows to the dataset or changing a prompt). Re-run processes all rows again and refreshes the experiment dataset results. - ![Update](/screenshot/product/dataset/how-to/experiments-in-dataset/10.png) + + Use **Rerun Experiment** to re-execute the entire experiment after editing prompts, models, evals, or the base column. Editing is granular: only the configurations you actually changed are re-executed, and results from untouched configurations are preserved. + + For more targeted reruns: + + - **Rerun a single cell**: hover any output or eval cell in the grid and click the rerun icon. Useful when one row failed transiently or you've tweaked a single configuration. + - **Rerun a column**: from the column header, choose **Run all cells in the column** or **Run only failed cells in the column**. Failed-only is the fastest way to recover from API hiccups without redoing successful work. + - **Rerun an eval**: re-execute a single eval across all rows after changing its config or mapping, without re-generating any model outputs. + + ![Update](/screenshot/product/dataset/how-to/experiments-in-dataset/9.png) - - When the experiment has finished, use the **Compare** (or comparison) view to see how each prompt–model combination performed. Set weights for eval scores and metrics (e.g. response time, token usage) to compute an overall ranking. The comparison shows which combination ranks best so you can choose a winner. + + Open the **Compare** view to see how every configuration performed. Set weights (0-10) for each eval score and for response time, completion tokens, and total tokens. The system normalizes the metrics, computes an overall rating per configuration, and ranks them so the winner is clear. Adjust the weights to match what matters for your use case (e.g. prioritize quality over cost) and the ranking updates in place. +## Tips + +- **Use published versions**: experiments only run published prompt and agent versions. Publish the version you want to test before importing it. +- **Mix prompts and agents**: an **LLM** experiment can contain prompts and agents side by side, scored against the same evals. Useful when you're deciding whether an agent is worth the extra complexity over a prompt. TTS, STT, and Image experiments accept prompts only. +- **Failed-only rerun**: when transient failures (rate limits, network blips) leave a few cells errored, use the failed-only rerun on the column to recover them without redoing successful rows. + ## Next Steps