Getting Started

  1. Open a dataset and click Benchmark prompts.
  2. Type in a name for the benchmark run in the pop-up modal.
  3. Select the test cases you want to use for the run.
  4. Select the prompts you want to benchmark. The prompt template variables should match those of the dataset.
  5. Select the evaluation metrics you want to use to score the outputs of the prompts. If the evaluation metrics rely on log.inputs the prompt templates’ variable names should match those expected by the evaluation metric.

After the benchmarking finishes, you can see the aggregated results in the “Benchmark” tab of the dataset: Benchmarking_Agg

You are also able to view & manually rate all inference results on the Benchmark - All Results tab:

Benchmarking_Detailed

Evaluate your entire LLM application

See Experiments for more details.