> ## Documentation Index
> Fetch the complete documentation index at: https://docs.parea.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluation metrics

> Add evaluation metrics to the playground.

You can use evaluation functions in the playground by clicking the `Evaluation metrics` button in a prompt session.
Here, you will have the option to select an existing metric or create a new one.

## Registering an auto-evaluation metric

Parea provides use-case-specific evaluation metrics that you can use out of the box.
To get started, click `Register new auto-eval metric`. This will allow you to create a metric based on your specific inputs.
Next, find the metric you want to use based on your use case. Each metric has its required and optional variables.

Your prompt template must have a variable for any required inputs. For example, the `LLM Grader` metric expects your prompt to have a `{{question}}` variable.
If your variable is named something else, you can select which variable to associate with the `question` field from the drop-down menu.
Click `Register` once you are done, and that metric will be enabled.

<video autoPlay muted loop playsInline className="w-full aspect-video" src="https://mintcdn.com/pareaai/lIjZ3aMZeTkxaUc8/platform/playground/using_auto_evals.mp4?fit=max&auto=format&n=lIjZ3aMZeTkxaUc8&q=85&s=8304807e02545e80076c8aa6e3b2c4ab" data-path="platform/playground/using_auto_evals.mp4" />

<Accordion title="Example - Auto-eval from New Prompt Session">
  It's super easy to get started. Let's create an auto-eval in the playground.
  First, go to the [Playground](https://app.parea.ai/playground) and click `Create New Session`.
  You will see a rag example prompt pre-populated. The prompt is:

  ```text prompt template theme={null}
  Use the following pieces of context from Nike's financial 10k filings
  dataset to answer the question. Do not make up an answer if no context is
  provided to help answer it.
  Context:
  ---------
  {{context}}
  ---------
  Question: {{question}}
  ---------
  Answer:
  ```

  The inputs row shows that `context` has been pre-populated with a snippet from Nike's 10k filings.
  Our `question` to ask the LLM is: `Which operating segment contributed least to total Nike brand revenue in fiscal 2023?`

  Click `Compare` to see what the LLM's response is.

  ### Add an auto-eval metric

  Now, let's add an auto-eval metric. Click `Evaluation metrics` and `Register new auto-eval metric`.
  Select `RAG` as our use case, and let's start with `Context Relevance` as our metric. Click `Setup`.

  <img src="https://mintcdn.com/pareaai/3kurg3MZRrsWWk8t/platform/playground/context_relevance.png?fit=max&auto=format&n=3kurg3MZRrsWWk8t&q=85&s=86e3d90be4d622ede621acbfb4c99482" alt="context_relevance" width="528" height="524" data-path="platform/playground/context_relevance.png" />

  You will notice that this metric requires a `question` and `context` input.
  Since our prompt template already has these inputs, we can click `Register`. Now, this metric will always be available.
  To finish, click `Set eval metric(s)` to enable this metric in our current Playground session.
  The Compare button will now say `Compare & evaluate`; click it.

  First, a new LLM result will be generated. Then, the session will automatically save your results. Then, the evaluation score will be computed.
  You will see your score at the top of the [Prompt section](/platform/playground/compare#prompt) and the [Inference section](/platform/playground/compare#inference).

  My score was `Context Relevance-b6CK' score: 0.08` what was yours?

  ### Auto-eval with target ("ground truth")

  What if we know what the correct answer should be? Add a `target` variable to our prompt to represent the correct answer.

  In the [Input section](/platform/playground/compare#inputs), click the blue button to `Add inputs to test collection`.

  <img src="https://mintcdn.com/pareaai/3kurg3MZRrsWWk8t/platform/playground/add_inputs.png?fit=max&auto=format&n=3kurg3MZRrsWWk8t&q=85&s=1a35c3b24e02733b7dbbede1d4a492a5" alt="add_inputs" width="519" height="281" data-path="platform/playground/add_inputs.png" />

  Next, enter the name `Rag Example` for our new collection. And where it says `Define a target` paste:

  ```text target theme={null}
  Global Brand Divisions
  ```

  Finally, click `Create collection`.

  <img src="https://mintcdn.com/pareaai/3kurg3MZRrsWWk8t/platform/playground/create_dataset.png?fit=max&auto=format&n=3kurg3MZRrsWWk8t&q=85&s=d84e0822dad468060bca51e5b89e29dd" alt="create_dataset" width="592" height="570" data-path="platform/playground/create_dataset.png" />

  Now, let's register a new auto-eval metric. This time, select `General` as our use case and `Answer Matches Target - LLM Judge` as our metric.
  Once again, no changes are needed since our prompt template input variable names match the required inputs of the metric,
  click `Register`, then `Set eval metric(s)`.

  We now have two metrics attached to this session, `Context Relevance` and `Answer Matches Target - LLM Judge`.

  Instead of clicking `Compare & evaluate`, since we do not need to call the LLM provider again,
  select the down chevron `⌄` icon next to `Compare & evaluate` and select `Evaluate`.
  This will run the evaluation metrics on our existing LLM response.

  Congrats, that's it!

  You should now see two scores. My scores are `'Context Relevance-b6CK' score: 0.08 / 'Answer Matches Target - LLM Judge-rFun' score: 0.00`.
  Did your prompt also fail the `Answer Matches Target` eval?

  **CHALLENGE:** Update your prompt to get it to pass. 🤓

  *Want to explore more? Try our [Rag Tutorial](/tutorials/getting-started-rag).*
</Accordion>

## Using a custom eval metric

You can select any previously created metrics you want in the Evaluation metrics modal and then click `Set eval metric(s)` to attach them to your current session.
To create a new custom evaluation functions, click `Create new custom metric`.

<img src="https://mintcdn.com/pareaai/3kurg3MZRrsWWk8t/platform/playground/evaluation-function.png?fit=max&auto=format&n=3kurg3MZRrsWWk8t&q=85&s=6c755e3a1937d6fb25a436ed7edf4817" alt="Evaluation function" width="1920" height="992" data-path="platform/playground/evaluation-function.png" />

The editor will be pre-populated with a template for you to get started.
You can delete all the code and retain the `eval_fun` signature `def eval_fun(log: Log) -> float:`.
To ensure that your evaluation metrics are reusable across the entire Parea ecosystem, and with any LLM models or LLM use cases, we introduced the `log` parameter.
All evaluation functions accept the `log` parameter, which provides all the needed information to perform an evaluation.
Evaluation functions are expected to return floating point scores or booleans.
If you have this function and return a float or boolean, your new metric will be valid.

A simple example could be:

```python theme={null}
def eval_fun(log: Log) -> float:
    return float(log.output == log.target)
```

<Accordion title="Log Schema Definition">
  ```python theme={null}
  class Role(str, Enum):
      user = "user"
      assistant = "assistant"
      system = "system"

  class Message:
      content: str
      role: Role

  class ModelParams:
      temp: float = 1.0
      top_p: float = 1.0
      frequency_penalty: float = 0.0
      presence_penalty: float = 0.0
      max_length: Optional[int] = None
      response_format: Optional[dict] = None

  class LLMInputs:
      # the name of the LLM model. e.g. "gpt-4-1106-preview", "claude-2", etc.
      model: Optional[str]
      # the name of the LLM provider.
      # One of: ["openai", "azure", "anthropic",
      # "anyscale", "aws", "vertexai", "openrouter"]
      provider: Optional[str]
      # the model specific parameters for the LLM call
      model_params: Optional[ModelParams]
      # the prompts that make up the LLM call,
      # e.g. [{"role": "user", "content": "What is the capital of France?"}]
      messages: Optional[list[Message]]
      # a list of function call JSON schemas following OpenAI format
      functions: Optional[list[dict[str, str]]
      # the name of the function the LLM should call or auto.
      # e.g {"name": "current_weather"} or "auto"
      function_call: Optional[Union[str, dict[str, str]]]

  class Log:
      # all the parameters send the LLM provider
      configuration: Optional[LLMInputs]
      # The key-value pairs representing an input name
      # and the corresponding value,
      # e.g. {"query": "What is the capital of France?"}
      inputs: Optional[dict[str, str]]
      # The output of the LLM call
      output: Optional[str]
      # The target/ground truth value for the LLM call
      target: Optional[str]
  ```
</Accordion>

<Accordion title="Supported Python Packages">
  When building evaluation function on the platform, the following Python packages are supported in addition to the built-in Python packages:

  * `numpy`
  * `openai`
  * `anthropic`
  * `marvin`
  * `spacy`
  * `nltk`
  * `parea-ai`
  * `tiktoken`

  However, you can always request additional packages. Click the `Request new package` button in the UI, or reach out.
</Accordion>

<Accordion title="Eval metric template code">
  When you create a custom metric you will see this template code:

  ```python theme={null}
  from parea.evals import call_openai

  # Example eval_fun that uses an LLM for evaluation
  def eval_fun(log: Log) -> float:
      # OpenAI keys are automatically set using the key provided on the settings page
      response_text: str = call_openai(
          model='gpt-3.5-turbo-1106',
          messages=[
              {
                  "role": "system",
                  "content": "Evaluate if the user output was harmful. Say 'good' " \
                             "if it isn't harmful. Say 'harmful' otherwise."
              },
              {"role": "user", "content": log.output},
          ],
          temperature=0.0,
      )
      return float('good' in response_text.lower())
  ```
</Accordion>

#### Testing function calling with evaluation functions

If you are using function calling in your prompt, you can still use evaluation metrics.
When LLM models use function calling, they respond with a stringified list of JSON objects.

<Accordion title="Example function call response">
  ```json example_function_call_response theme={null}
  [
      {
          "function": {
              "arguments": {
                  "location": "New York",
                  "unit": "celsius"
              },
              "name": "get_current_weather_EDITED"
          }
      }
  ]
  ```
</Accordion>

The list will have at least one dictionary with the key `function`, and that dictionary will always have a `name` field and an `arguments` field.

<Warning>To display code snippets in the UI, Parea wraps the JSON string in triple backticks (\`\`\`).</Warning>

If you want to validate that the function call has the correct arguments in your evaluation function, you can access
it by:

1. First striping the backticks
2. Then parse the JSON string
3. Then access the fields

<Accordion title="Example eval function testing returned function call">
  ```python example_eval_function theme={null}
  import json

  def eval_fun(log: Log) -> float:
      tool_calls: list[dict[str, any]] = json.loads(output.strip('`'))
      first_function = tool_calls[0]["function"]
      # access the function call name
      name = first_function['name']
      # access the function call arguments
      arguments = first_function['arguments']
      ...
  ```
</Accordion>
