Large-scale evaluate your prompts on a test case collection across multiple evaluation metrics and measure latency and cost. For that

  1. Open a test case collection and click “Benchmark prompts”.
  2. Type in a name for the run in the modal that pops up.
  3. Select the test cases you want to use for the run.
  4. Select the prompts you want to benchmark.
  5. Select the evaluation metrics you want to use to score the outputs of the prompts.

After the benchmarking is done, you can see the aggregated results in the “Benchmark” tab of the test case collection: Benchmarking

You are also able to view & manually rate all inference result in the “Benchmark - All Results” tab:


Chains & Agents

You can benchmark your LLM app across many inputs by using the benchmark command of our Python SDK. This will run your the entry point of your app with the specified inputs and create a report with the results.

parea benchmark --func app:main --csv_path benchmark.csv

The CSV file will be used to fill in the arguments to your function. The report will be a CSV file of all the traces. If you set your Parea API key, the traces will also be logged to the Parea dashboard. Note, for this feature you need to have a redis cache running. Contact us if you would like to use this feature without a redis cache.