Evaluation metrics
Add evaluation metrics to the playground.
You can use evaluation functions in the playground by clicking the Evaluation metrics
button in a prompt session.
Here, you will have the option to select an existing metric or create a new one.
Registering an auto-evaluation metric
Parea provides use-case-specific evaluation metrics that you can use out of the box.
To get started, click Register new auto-eval metric
. This will allow you to create a metric based on your specific inputs.
Next, find the metric you want to use based on your use case. Each metric has its required and optional variables.
Your prompt template must have a variable for any required inputs. For example, the LLM Grader
metric expects your prompt to have a {{question}}
variable.
If your variable is named something else, you can select which variable to associate with the question
field from the drop-down menu.
Click Register
once you are done, and that metric will be enabled.
Using a custom eval metric
You can select any previously created metrics you want in the Evaluation metrics modal and then click Set eval metric(s)
to attach them to your current session.
To create a new custom evaluation functions, click Create new custom metric
.
The editor will be pre-populated with a template for you to get started.
You can delete all the code and retain the eval_fun
signature def eval_fun(log: Log) -> float:
.
To ensure that your evaluation metrics are reusable across the entire Parea ecosystem, and with any LLM models or LLM use cases, we introduced the log
parameter.
All evaluation functions accept the log
parameter, which provides all the needed information to perform an evaluation.
Evaluation functions are expected to return floating point scores or booleans.
If you have this function and return a float or boolean, your new metric will be valid.
A simple example could be:
Testing function calling with evaluation functions
If you are using function calling in your prompt, you can still use evaluation metrics. When LLM models use function calling, they respond with a stringified list of JSON objects.
The list will have at least one dictionary with the key function
, and that dictionary will always have a name
field and an arguments
field.
If you want to validate that the function call has the correct arguments in your evaluation function, you can access it by:
- First striping the backticks
- Then parse the JSON string
- Then access the fields
Was this page helpful?