- Open a dataset and click
- Type in a name for the benchmark run in the pop-up modal.
- Select the test cases you want to use for the run.
- Select the prompts you want to benchmark. The prompt template variables should match those of the dataset.
- Select the evaluation metrics you want to use to score the outputs of the prompts. If the evaluation metrics rely on
log.inputsthe prompt templates’ variable names should match those expected by the evaluation metric.
After the benchmarking finishes, you can see the aggregated results in the “Benchmark” tab of the dataset:
You are also able to view & manually rate all inference results on the
Benchmark - All Results tab:
Evaluate your entire LLM application
See Experiments for more details.