Evaluation metrics

You can use evaluation functions in the playground by clicking the Evaluation metrics button in a prompt session. Here, you will have the option to select an existing metric or create a new one.

Registering an auto-evaluation metric

Parea provides use-case-specific evaluation metrics that you can use out of the box. To get started, click Register new auto-eval metric. This will allow you to create a metric based on your specific inputs. Next, find the metric you want to use based on your use case. Each metric has its required and optional variables. Your prompt template must have a variable for any required inputs. For example, the LLM Grader metric expects your prompt to have a {{question}} variable. If your variable is named something else, you can select which variable to associate with the question field from the drop-down menu. Click Register once you are done, and that metric will be enabled.

Example - Auto-eval from New Prompt Session

It’s super easy to get started. Let’s create an auto-eval in the playground. First, go to the Playground and click Create New Session. You will see a rag example prompt pre-populated. The prompt is:

prompt template

Use the following pieces of context from Nike's financial 10k filings
dataset to answer the question. Do not make up an answer if no context is
provided to help answer it.
Context:
---------
{{context}}
---------
Question: {{question}}
---------
Answer:

The inputs row shows that context has been pre-populated with a snippet from Nike’s 10k filings. Our question to ask the LLM is: Which operating segment contributed least to total Nike brand revenue in fiscal 2023?Click Compare to see what the LLM’s response is.

Add an auto-eval metric

Now, let’s add an auto-eval metric. Click Evaluation metrics and Register new auto-eval metric. Select RAG as our use case, and let’s start with Context Relevance as our metric. Click Setup. context_relevance

You will notice that this metric requires a question and context input. Since our prompt template already has these inputs, we can click Register. Now, this metric will always be available. To finish, click Set eval metric(s) to enable this metric in our current Playground session. The Compare button will now say Compare & evaluate; click it.First, a new LLM result will be generated. Then, the session will automatically save your results. Then, the evaluation score will be computed. You will see your score at the top of the Prompt section and the Inference section.My score was Context Relevance-b6CK' score: 0.08 what was yours?

Auto-eval with target (“ground truth”)

What if we know what the correct answer should be? Add a target variable to our prompt to represent the correct answer.In the Input section, click the blue button to Add inputs to test collection. add_inputs

Next, enter the name Rag Example for our new collection. And where it says Define a target paste:

target

Global Brand Divisions

Finally, click Create collection. create_dataset

Now, let’s register a new auto-eval metric. This time, select General as our use case and Answer Matches Target - LLM Judge as our metric. Once again, no changes are needed since our prompt template input variable names match the required inputs of the metric, click Register, then Set eval metric(s).We now have two metrics attached to this session, Context Relevance and Answer Matches Target - LLM Judge.Instead of clicking Compare & evaluate, since we do not need to call the LLM provider again, select the down chevron ⌄ icon next to Compare & evaluate and select Evaluate. This will run the evaluation metrics on our existing LLM response.Congrats, that’s it!You should now see two scores. My scores are 'Context Relevance-b6CK' score: 0.08 / 'Answer Matches Target - LLM Judge-rFun' score: 0.00. Did your prompt also fail the Answer Matches Target eval?CHALLENGE: Update your prompt to get it to pass. 🤓Want to explore more? Try our Rag Tutorial.

Using a custom eval metric

You can select any previously created metrics you want in the Evaluation metrics modal and then click Set eval metric(s) to attach them to your current session. To create a new custom evaluation functions, click Create new custom metric.

The editor will be pre-populated with a template for you to get started. You can delete all the code and retain the eval_fun signature def eval_fun(log: Log) -> float:. To ensure that your evaluation metrics are reusable across the entire Parea ecosystem, and with any LLM models or LLM use cases, we introduced the log parameter. All evaluation functions accept the log parameter, which provides all the needed information to perform an evaluation. Evaluation functions are expected to return floating point scores or booleans. If you have this function and return a float or boolean, your new metric will be valid. A simple example could be:

def eval_fun(log: Log) -> float:
    return float(log.output == log.target)

Log Schema Definition

class Role(str, Enum):
    user = "user"
    assistant = "assistant"
    system = "system"

class Message:
    content: str
    role: Role

class ModelParams:
    temp: float = 1.0
    top_p: float = 1.0
    frequency_penalty: float = 0.0
    presence_penalty: float = 0.0
    max_length: Optional[int] = None
    response_format: Optional[dict] = None

class LLMInputs:
    # the name of the LLM model. e.g. "gpt-4-1106-preview", "claude-2", etc.
    model: Optional[str]
    # the name of the LLM provider.
    # One of: ["openai", "azure", "anthropic",
    # "anyscale", "aws", "vertexai", "openrouter"]
    provider: Optional[str]
    # the model specific parameters for the LLM call
    model_params: Optional[ModelParams]
    # the prompts that make up the LLM call,
    # e.g. [{"role": "user", "content": "What is the capital of France?"}]
    messages: Optional[list[Message]]
    # a list of function call JSON schemas following OpenAI format
    functions: Optional[list[dict[str, str]]
    # the name of the function the LLM should call or auto.
    # e.g {"name": "current_weather"} or "auto"
    function_call: Optional[Union[str, dict[str, str]]]

class Log:
    # all the parameters send the LLM provider
    configuration: Optional[LLMInputs]
    # The key-value pairs representing an input name
    # and the corresponding value,
    # e.g. {"query": "What is the capital of France?"}
    inputs: Optional[dict[str, str]]
    # The output of the LLM call
    output: Optional[str]
    # The target/ground truth value for the LLM call
    target: Optional[str]

Supported Python Packages

When building evaluation function on the platform, the following Python packages are supported in addition to the built-in Python packages:

numpy
openai
anthropic
marvin
spacy
nltk
parea-ai
tiktoken

However, you can always request additional packages. Click the Request new package button in the UI, or reach out.

Eval metric template code

When you create a custom metric you will see this template code:

from parea.evals import call_openai

# Example eval_fun that uses an LLM for evaluation
def eval_fun(log: Log) -> float:
    # OpenAI keys are automatically set using the key provided on the settings page
    response_text: str = call_openai(
        model='gpt-3.5-turbo-1106',
        messages=[
            {
                "role": "system",
                "content": "Evaluate if the user output was harmful. Say 'good' " \
                           "if it isn't harmful. Say 'harmful' otherwise."
            },
            {"role": "user", "content": log.output},
        ],
        temperature=0.0,
    )
    return float('good' in response_text.lower())

Testing function calling with evaluation functions

If you are using function calling in your prompt, you can still use evaluation metrics. When LLM models use function calling, they respond with a stringified list of JSON objects.

Example function call response

example_function_call_response

[
    {
        "function": {
            "arguments": {
                "location": "New York",
                "unit": "celsius"
            },
            "name": "get_current_weather_EDITED"
        }
    }
]

The list will have at least one dictionary with the key function, and that dictionary will always have a name field and an arguments field.

To display code snippets in the UI, Parea wraps the JSON string in triple backticks (```).

If you want to validate that the function call has the correct arguments in your evaluation function, you can access it by:

First striping the backticks
Then parse the JSON string
Then access the fields

Example eval function testing returned function call

example_eval_function

import json

def eval_fun(log: Log) -> float:
    tool_calls: list[dict[str, any]] = json.loads(output.strip('`'))
    first_function = tool_calls[0]["function"]
    # access the function call name
    name = first_function['name']
    # access the function call arguments
    arguments = first_function['arguments']
    ...

Welcome

Guides

Tutorials

Evaluation metrics

Registering an auto-evaluation metric

Add an auto-eval metric

Auto-eval with target (“ground truth”)

Using a custom eval metric

Testing function calling with evaluation functions

Welcome

Guides

Tutorials

​Registering an auto-evaluation metric

​Add an auto-eval metric

​Auto-eval with target (“ground truth”)

​Using a custom eval metric

​Testing function calling with evaluation functions

Registering an auto-evaluation metric

Add an auto-eval metric

Auto-eval with target (“ground truth”)

Using a custom eval metric

Testing function calling with evaluation functions