Evaluation functions
A key part of successful prompt engineering is robust evaluation. Parea enables you to define customizable evaluation functions that can be used to score the outputs of LLMs.
Evaluation functions in Python
You can create evaluation function in Python by going to the Test Hub tab and clicking on the “Create function eval” button.
You are then prompted to implement the function eval_fun
which receives these arguments:
inputs
: key-value pairs of inputs fed into the prompt template, chain or agentoutput
: output of the LLMtarget
: optional target for inputs
The function should return a float between 0 (bad) and 1 (good) inclusive.
Supported packages
In addition to the built-in Python packages, we currently support:
numpy
openai
anthropic
marvin
Debug an evaluation metric
You can run an evaluation function on the results of a test case collections and edit the inputs to debug its behavior. For that
- Click on the “Debug on data” button in the bottom left of the function modal
- Select a test case collection
- Select the rows on which you want to run the function
- Click on the “Run” button
You can edit any inputs, output
or target
when clicking on the respective entries in the table.
Webhook
In order to work with our webhook evaluation metrics, you will have to define a function that accepts a JSON payload (defined below), and returns a float between 0 and 1 inclusive; 1.0 means a good result and 0.0 means a bad result.
Your webhook function should:
-
Be exposed via an unauthenticated API such that Parea can make an HTTP POST request to.
-
Encode business logic that interprets the provided JSON payload and returns a numerical score representing the quality of the LLM output.
-
Accept the JSON payload:
{ "inputs": { // Key/value pairs for each variable in the test case / prompt "var_1": "var_1_value", "var_2": "var_2_value", ... }, "target": string | null, // An optional specified expected output value "output": string // The LLM's output }
-
Return a float number which is between 0 (”bad”) and 1 (”good”) inclusive.
Using Evaluation Functions in the Lab
You can use evaluation functions in the Lab by clicking the “Evaluation metrics” button in a lab session, and then select one or multiple evaluation functions to use. The scores of the evaluation functions will be displayed on every inference cell and the average score will be displayed for every column.
Using Evaluation Functions on Production Traffic
You can use evaluation functions on production traffic as outlined in the Track Metrics section.