Creating Evaluation Metrics

From the Evaluations tab, you can create an evaluation metric from a function or via a webhook.

From Python function

You can create an evaluation metric from a Python function in two ways:

Debugging your evaluation function

You can run an evaluation function on the results of a dataset and edit the inputs to debug its behavior. For that

  1. Click on the Debug on data button in the bottom left of the function modal
  2. Select a dataset that has the correct prompt template inputs as your evaluation metric
  3. Select the test case rows on which you want to run the function
  4. Click on the Run button

You can edit any inputs, output or target when clicking on the respective entries in the table.

Debug evaluation function

Supported packages

When building evaluation function on the platform, the following Python packages are supported:

  • numpy
  • openai
  • anthropic
  • marvin
  • spacy
  • nltk
  • parea-ai
  • tiktoken
  • built-in Python packages

However, you can always request additional packages. Click the Request new package button in the UI, or reach out.

Using function calling with evaluation functions

If you are using function calling in your prompt, you can still use evaluation metrics. When LLM models use function calling, they respond with a stringified list of JSON objects.

example_function_call_response
[
    {
        "function": {
            "arguments": {
                "location": "New York",
                "unit": "celsius"
            },
            "name": "get_current_weather_EDITED"
        }
    }
]

The list will have at least one object with a function parameter, and the function object will always have a name field and an arguments field.

To display code snippets in the UI, Parea wraps the JSON string in triple backticks (```).

If you want to validate that the function call has the correct arguments in your evaluation function, you can access it by:

  1. First striping the backticks
  2. Then parse the JSON string
  3. Then access the fields
example_eval_function
import json

def eval_fun(log: Log) -> float:
    tool_calls: list[dict[str, any]] = json.loads(output.strip('`'))
    first_function = tool_calls[0]["function"]
    # access the function call name
    name = first_function['name']
    # access the function call arguments
    arguments = first_function['arguments']
    ...

From a Webhook

To work with our webhook evaluation metrics, you will have to define a function that accepts a JSON payload (described below), and returns a float or a boolean;

Your webhook function should:

  1. Be exposed via an unauthenticated API to which Parea can make an HTTP POST request.
  2. Encode business logic that interprets the provided JSON payload and returns a numerical score representing the quality of the LLM output.
  3. Accept the JSON payload:
{
	"inputs": {       // Key/value pairs for each variable in the test case / prompt
		"var_1": "var_1_value",
		"var_2": "var_2_value",
		...
	},
	"target": string | null,   // An optional specified expected output value
	"output": string           // The LLM's output
}
  1. Return a float number.