A key part of successful prompt engineering is robust evaluation. Parea enables you to define customizable evaluation functions that can be used to score the outputs of LLMs.

Evaluation functions in Python

You can create evaluation function in Python by going to the Test Hub tab and clicking on the “Create function eval” button. You are then prompted to implement the function eval_fun which receives these arguments:

  • inputs: key-value pairs of inputs fed into the prompt template, chain or agent
  • output: output of the LLM
  • target: optional target for inputs

The function should return a float between 0 (bad) and 1 (good) inclusive.

Evaluation function

Supported packages

In addition to the built-in Python packages, we currently support:

  • numpy
  • openai
  • anthropic
  • marvin

Debug an evaluation metric

You can run an evaluation function on the results of a test case collections and edit the inputs to debug its behavior. For that

  1. Click on the “Debug on data” button in the bottom left of the function modal
  2. Select a test case collection
  3. Select the rows on which you want to run the function
  4. Click on the “Run” button

You can edit any inputs, output or target when clicking on the respective entries in the table.

Debug evaluation function


In order to work with our webhook evaluation metrics, you will have to define a function that accepts a JSON payload (defined below), and returns a float between 0 and 1 inclusive; 1.0 means a good result and 0.0 means a bad result.

Your webhook function should:

  1. Be exposed via an unauthenticated API such that Parea can make an HTTP POST request to.

  2. Encode business logic that interprets the provided JSON payload and returns a numerical score representing the quality of the LLM output.

  3. Accept the JSON payload:

    	"inputs": {       // Key/value pairs for each variable in the test case / prompt
    		"var_1": "var_1_value",
    		"var_2": "var_2_value",
    	"target": string | null,   // An optional specified expected output value
    	"output": string           // The LLM's output
  4. Return a float number which is between 0 (”bad”) and 1 (”good”) inclusive.

Using Evaluation Functions in the Lab

You can use evaluation functions in the Lab by clicking the “Evaluation metrics” button in a lab session, and then select one or multiple evaluation functions to use. The scores of the evaluation functions will be displayed on every inference cell and the average score will be displayed for every column.

Using Evaluation Functions on Production Traffic

You can use evaluation functions on production traffic as outlined in the Track Metrics section.