You can use evaluation functions in the playground by clicking the Evaluation metrics button in a prompt session. Here, you will have the option to select an existing metric or create a new one.

Registering an auto-evaluation metric

Parea provides use-case-specific evaluation metrics that you can use out of the box. To get started, click Register new auto-eval metric. This will allow you to create a metric based on your specific inputs. Next, find the metric you want to use based on your use case. Each metric has its required and optional variables.

Your prompt template must have a variable for any required inputs. For example, the LLM Grader metric expects your prompt to have a {{question}} variable. If your variable is named something else, you can select which variable to associate with the question field from the drop-down menu. Click Register once you are done, and that metric will be enabled.

Using a custom eval metric

You can select any previously created metrics you want in the Evaluation metrics modal and then click Set eval metric(s) to attach them to your current session. To create a new custom evaluation functions, click Create new custom metric.

Evaluation function

The editor will be pre-populated with a template for you to get started. You can delete all the code and retain the eval_fun signature def eval_fun(log: Log) -> float:. To ensure that your evaluation metrics are reusable across the entire Parea ecosystem, and with any LLM models or LLM use cases, we introduced the log parameter. All evaluation functions accept the log parameter, which provides all the needed information to perform an evaluation. Evaluation functions are expected to return floating point scores or booleans. If you have this function and return a float or boolean, your new metric will be valid.

A simple example could be:

def eval_fun(log: Log) -> float:
    return float(log.output == log.target)

Testing function calling with evaluation functions

If you are using function calling in your prompt, you can still use evaluation metrics. When LLM models use function calling, they respond with a stringified list of JSON objects.

The list will have at least one dictionary with the key function, and that dictionary will always have a name field and an arguments field.

To display code snippets in the UI, Parea wraps the JSON string in triple backticks (```).

If you want to validate that the function call has the correct arguments in your evaluation function, you can access it by:

  1. First striping the backticks
  2. Then parse the JSON string
  3. Then access the fields