Evaluation

This guide walks you through the process of evaluating your LLM application step-by-step. We assume that you are comfortable running experiments in Python or TypeScript and have an API key. If you haven’t already, make sure to read the evaluation quickstart first. Experiment Comparison

Conceptualizing Testing & Evaluating LLM Applications

Because LLMs have non-deterministic behavior, it’s critical to understand the accuracy of your LLM app while you’re developing, and before you ship. It’s also important to track performance over time. It’s unlikely that you’ll ever hit 100% correctness, but it’s important that for the metrics you care about, you’re steadily improving. Parea’s experiments allow you to capture this into a simple, effective workflow that enables you ship more reliable, higher quality products. To evaluate your LLM app, you need some data (10 samples are sufficient) and a function that executes a task. Your data need to contain the inputs to the function as key-value pairs and can optionally contain a ground truth value for that sample (indicated by the key target). While 10 samples are sufficient to get started, it’s useful to continue to add more samples whenever you identify new failure cases from your production traffic. The goal of evaluation is to assess each of component and improve them over time. In practice, it’s better to assume for that your data is noisy, the LLM is imperfect, and evaluation methods are a little bit wrong.

Evaluation Metrics

You can either use pre-built SOTA evaluation metrics or define your own custom evaluation metrics. You can use evaluation metrics by attaching them via the trace decorator to your function. This will automatically execute the evaluation metric on the output of the function in non-blocking in the background after the function execution has finished. You can apply evaluations at any “level” of your application by attaching them to the corresponding function decorated with trace. This is useful to understand the quality at different levels of granularity (“Was the right context extracted?” vs. “Was the answer correct?”).

Using Pre-built SOTA Evaluations

Parea provides a set of state-of-the-art evaluation metrics you can plug into your evaluation process. Their motivation & research are discussed in the blog post on reference-free and reference-based evaluation metrics. Here is an overview of them:

General Purpose Evaluation

levenshtein: calculates the number of character-edits in the generated output to match the target and normalizes it by the length of the output; more here
llm_grader: leverages a general-purpose zero-shot prompt to rate responses from an LLM to a given question on a scale from 1-10; more here
answer_relevancy: measures how relevant the generated response is to the given question; more here
self_check: measures how well the LLM call is self consistent when generating multiple responses; more here
lm_vs_lm_factuality: uses another LLM to examine original LLM response for factuality; more here
semantic_similarity: calculates the cosine similarity between output and ground truth; more here

RAG Specific Evaluations

context_query_relevancy: calculates the percentage of sentences in the context are relevant to the query; more here
context_ranking_pointwise: measures how well the retrieved contexts are ranked by relevancy to the given query by pointwise estimation; more here
context_ranking_listwise: measures how well the retrieved contexts are ranked by relevancy to the given query by listwise estimation; more here
context_has_answer: classifies if the retrieved context contains the answer to the query; more here
answer_context_faithfulness_binary: classifies if the answer is faithful to the context; more here
answer_context_faithfulness_precision: calculates how many tokens in the generated answer are also present in the retrieved context; more here
answer_context_faithfulness_statement_level: calculates the percentage of statements from the generated answer that can be inferred from the context; more here

Chatbot Specific Evaluations

goal_success_ratio: measures how many turns a user has to converse on average with your AI assistant to achieve a goal; more here

Summarization Specific Evaluations

factual_inconsistency_binary: classifies if a summary is factually inconsistent with the original text; more here
factual_inconsistency_scale: grades the factual consistency of a summary with the article on a scale from 1 to 10; more here
likert_scale: grades the quality of a summary on a Likert scale from 1-5 along the dimensions of relevance, consistency, fluency, and coherence; more here

Sometimes the exposed evaluation metrics are actually factory methods that return the evaluation metric. This is needed to configure the evaluation metric to your use case.

from parea import trace
from parea.evals.general import levenshtein

# annotate function with the trace decorator and pass the evaluation function(s)
@trace(eval_funcs=[levenshtein])
def greeting(name: str) -> str:
    return f"Hello {name}"

# define the experiment and run it
p.experiment(...)

Custom Evaluations

You can also define your own evaluation metrics. They can be as simple as checking if a certain word is in the output as shown below.

from parea import trace
from parea.schemas import Log

def eval_harmfulness(log: Log):
    return not "bad word" in log.output:

@trace(eval_funcs=[eval_harmfulness])
def experiment_entry_point(msg: str) -> str:
    return "good word"

To ensure that your evaluation metrics are reusable across the entire Parea ecosystem, and with any LLM models or LLM use cases, we introduced the log parameter. All evaluation functions accept the log parameter, which provides all the needed information to perform an evaluation. Evaluation functions are expected to return floating point scores or booleans.

Log Schema Definition

class Role(str, Enum):
    user = "user"
    assistant = "assistant"
    system = "system"

class Message:
    content: str
    role: Role

class ModelParams:
    temp: float = 1.0
    top_p: float = 1.0
    frequency_penalty: float = 0.0
    presence_penalty: float = 0.0
    max_length: Optional[int] = None
    response_format: Optional[dict] = None

class LLMInputs:
    # the name of the LLM model. e.g. "gpt-4-1106-preview", "claude-2", etc.
    model: Optional[str]
    # the name of the LLM provider.
    # One of: ["openai", "azure", "anthropic",
    # "anyscale", "aws", "vertexai", "openrouter"]
    provider: Optional[str]
    # the model specific parameters for the LLM call
    model_params: Optional[ModelParams]
    # the prompts that make up the LLM call,
    # e.g. [{"role": "user", "content": "What is the capital of France?"}]
    messages: Optional[list[Message]]
    # a list of function call JSON schemas following OpenAI format
    functions: Optional[list[dict[str, str]]
    # the name of the function the LLM should call or auto.
    # e.g {"name": "current_weather"} or "auto"
    function_call: Optional[Union[str, dict[str, str]]]

class Log:
    # all the parameters send the LLM provider
    configuration: Optional[LLMInputs]
    # The key-value pairs representing an input name
    # and the corresponding value,
    # e.g. {"query": "What is the capital of France?"}
    inputs: Optional[dict[str, str]]
    # The output of the LLM call
    output: Optional[str]
    # The target/ground truth value for the LLM call
    target: Optional[str]

Skipping Evaluation Metrics

Sometimes it’s useful to apply evaluation metrics only on a subset of the data. For that, you return None/null in the evaluation function to skip the evaluation for that log.

def my_eval_name(log: Log):
    if log.inputs.get("key") == "value-to-skip":
        return None
    return log.output == log.target

Return a Reason

While numerical scores are useful to get a fast grasp of how good a response is, it’s useful to provide a reason for why a response is bad. Especially, if one uses a LLM-based evaluation metric, it’s good to pass on the reasons for why the LLM assigned a certain score. You can do that by returning an EvaluationResult object with a reason attribute. Updating above example:

def eval_harmfulness(log: Log) -> EvaluationResult:
    if "bad word" in log.output:
        return EvaluationResult(name="harmfulness", score=0.0, reason="bad word found")
    return EvaluationResult(name="harmfulness", score=1.0)

@trace(eval_funcs=[eval_harmfulness])
def experiment_entry_point(msg: str) -> str:
    return "good word"

The full-definition of the EvaluationResult object is as follows:

EvaluationResult Definition

class EvaluationResult:
    name: str
    score: float
    reason: Optional[str] = None

Returning Multiple Scores

Sometimes there is shared work between different evaluation metrics such that returning multiple scores is better. You can do that by returning a list of EvaluationResult objects.

Running & Naming of Experiments

Depending on whether you use Python or TypeScript, you can name & run experiments in the below ways. You need to name an experiment and can optionally specify the name of a particular run. Note, that the run name can only contain alphanumeric characters, dashes, and underscores and needs to be unique within each project. After an experiment is finished, you will see an overview of the average stats and a link to the experiment page with full details.

Python
TypeScript

In Python, you can define an experiment by calling the experiment method of the Parea client p and run it by calling the run method of the experiment. You can optionally specify the name of the run by passing it to the run method.

p.experiment(
    name=experiment_name, # name of the experiment
    data=data,  # test data to run the experiment on (list of dicts)
    func=func,  # function to run (callable)
).run()         # you can optionally specify the run name by passing `run_name` to the run method

Alternatively, you can also use the CLI to run an experiment:

parea experiment path/to/experiment_file.py

In this case the experiment file should contain the experiment definition but not call the run method. You can optionally specify the run name of the experiment using the --run_name flag in the CLI.

p.experiment(
    name=experiment_name, # name of the experiment
    data=data,  # test data to run the experiment on (list of dicts)
    func=func,  # function to run (callable)
)

Organization of Experiments & Historical Trends of Evals

Within projects, you can organize your experiments by the experiment name. You can filter the experiments table by the experiment name to see all runs of that experiment. This will also filter the graph of historical experiment eval scores by that name. Experiment Results by Experiment Names

Organize Experiments by Projects

You can organize your experiments by projects. By default, all logs, traces & experiments are grouped in the default project. You can specify a project name when you initialize the Parea client. This will automatically create a new project with that name if it didn’t exist before. Note, that the name of the project is only allowed to contain alphanumeric characters, dashes and underscores.

from parea import Parea

p = Parea(api_key=os.getenv("PAREA_API_KEY"), project_name='my-project')

You can toggle between projects on the platform by clicking on the project name in the collapsible sidebar on the left.

Testing Sub-steps

When building RAG applications, it’s useful to test the retrieval and generation steps separately to pinpoint what goes wrong. Similarly, when building agents, one typically wants to test every individual step for the expected output to catch cascading failures. You can test sub-steps via Parea by attaching the trace decorator to the respective functions with the corresponding eval. This will create scores from every sub-step, when running the entrypoint function (no matter if running an experiment or running the function directly). To access different targets/expected outputs for sub-steps, you can pass the target as a dictionary when defining the experiment. Note, that this will convert the target to a string such that you will need to convert it back to a dictionary in the evaluation function in order to access the sub-step target. See Python & TypeScript examples below.

Example: Testing Sub-steps

import json
import os
from typing import Union

from dotenv import load_dotenv

from parea import Parea, trace
from parea.evals.general.levenshtein import levenshtein_distance
from parea.schemas import Log

load_dotenv()

p = Parea(api_key=os.getenv("PAREA_API_KEY"))


# evaluation function for the substep
def eval_choose_greeting(log: Log) -> Union[float, None]:
    if not (target := log.target):
        return None

    target_substep = json.loads(target)["substep"]  # log.target is a string
    output = log.output
    return levenshtein_distance(target_substep, output)


# sub-step
@trace(eval_funcs=[eval_choose_greeting])
def choose_greeting(name: str) -> str:
    return "Hello"


# end-to-end evaluation function
def eval_greet(log: Log) -> Union[float, None]:
    if not (target := log.target):
        return None

    target_overall = json.loads(target)["overall"]
    output = log.output
    return levenshtein_distance(target_overall, output)


@trace(eval_funcs=[eval_greet])
def greet(name: str) -> str:
    greeting = choose_greeting(name)
    return f"{greeting} {name}"


data = [
    {
        "name": "Foo",
        "target": {
            "overall": "Hi Foo",
            "substep": "Hi",
        },
    },
    {
        "name": "Bar",
        "target": {
            "overall": "Hello Bar",
            "substep": "Hello",
        },
    },
]


if __name__ == "__main__":
    p.experiment(
        name="greeting",
        data=data,
        func=greet,
    ).run()

Trials

Given the non-deterministic behavior of LLMs it could be useful to run the evaluation multiple times per input to check response consistency, i.e., to run multiple trials. You can do that by specifying the n_trials parameter when defining an experiment:

p.experiment(
    ...,
    n_trials=3,
)

Debug Individual Traces

You can debug individual traces to understand how your LLM app returned an output on specific inputs. You do that by clicking on the row in the log table of the experiment. In the trace view, you can step through the trace (left sidebar), see any inputs, outputs, and metadata (middle) of the selected step, and view any metrics and scores (right side). Debug Trace

Use Saved Datasets

When running and experiment, you can use your datasets saved on Parea. For the data field just provide the name of the dataset as defined on the Datasets tab. The dataset should have column names that match the input parameters of the function you are running the experiment on. Note, the dataset name will be automatically stored in the “Dataset” key for the experiment metadata.

p.experiment(
    name='Experiment Name',
    data="Dataset Name",
    func=func,
)

Dataset Level Evaluation

Sometimes it’s useful to aggregate the evaluation scores over the entire dataset, e.g., for balanced accuracy. You can do that by specifying the dataset_level_evals or datasetLevelEvals parameter when defining an experiment:

p.experiment(
    ...,
    dataset_level_evals=[...],  # list of evaluation functions applied on the entire dataset
)

The evaluation functions will receive a list of EvaluatedLog objects and are expected to return a single floating point score or a boolean. The EvaluatedLog object is a subclass of the Log object with the additional scores attribute. The scores will be attached to the Experiment and can be viewed in the overview table and the detailed view of the experiment.

EvaluatedLog Schema Definition

class EvaluationResult:
    name: str
    score: float


class EvaluatedLog(Log):
    scores: Optional[list[EvaluationResult]] = None

Dataset Level Evaluation Example (full)

import os
from collections import defaultdict

from dotenv import load_dotenv

from parea import Parea, trace
from parea.schemas import EvaluatedLog, Log

load_dotenv()

p = Parea(api_key=os.getenv("PAREA_API_KEY"))


def is_correct(log: Log) -> bool:
    return log.target == log.output


def balanced_acc_is_correct(logs: list[EvaluatedLog]) -> float:
    score_name = is_correct.__name__

    correct = defaultdict(int)
    total = defaultdict(int)
    for log in logs:
        if (eval_result := log.get_score(score_name)) is not None:
            correct[log.target] += int(eval_result.score)
            total[log.target] += 1
    recalls = [correct[key] / total[key] for key in correct]

    return sum(recalls) / len(recalls)


@trace(eval_funcs=[is_correct])
def starts_with_f(name: str) -> str:
    if name == "Foo":
        return "1"
    return "0"


data = [
    {
        "name": "Foo",
        "target": "1",
    },
    {
        "name": "Bar",
        "target": "0",
    },
    {
        "name": "Far",
        "target": "1",
    },
]  # test data to run the experiment on (list of dicts)


# You can optionally run the experiment manually by calling `.run()`
if __name__ == "__main__":
    p.experiment(
        name='Dataset Level Evals', data=data, func=starts_with_f, dataset_level_evals=[balanced_acc_is_correct]
    ).run()

Pre-built Dataset Level Evaluations

To simplify the process of defining dataset level evaluations, Parea provides a set of pre-built dataset level evaluations:

balanced_accuracy: The balanced accuracy of a score for the experiment; more here

Investigating Relationship Between Statistics

Sometimes it is useful to understand the relationship between different evaluation metrics, and, manual annotations. An example would be to assess how well your eval correlations / agrees with manual annotations. You can view a scatter plot of the relationship between two stats by selecting “Relationship between scores” in the dropdown above the graph in the detailed experiment view. Additionally, you can view the accuracy & correlation between the two selected variables and add that value to the experiment scores. Going back to the example of assessing how much your eval agrees with manual annotations, you can now track the quality of your eval over time and align it better with manual annotations. Investigate Relationship

All your experiments are shared by default in your organization and not publicly accessible. You can share experiments publicly by clicking on the Share button on the top right of the experiment page. This will generate a link following the format https://app.parea.ai/public-experiments/<org_slug>/<project_name>/<experiment_uuid> which anyone can access. You can compare all public experiments in a project under https://app.parea.ai/public-experiments/<org_slug>/<project_name>.

Experiment Code Management

As you iterate on your LLM app and test the changes, keeping track of which change led to which result is cumbersome. For that Parea provides an integration with DVC’s experiment tracking. This enables you to iterate on your LLM app without polluting your git history with a commit for every experiment and still have the ability to revert your code to the state of any experiment. Once integrated, every time you run an experiment, the state of the workspace together with the associated metrics will be automatically captured. This will enable you to compare experiment metrics and to revert your workspace to the state of them.

This is currently only supported for Python. Please, reach out if you want that functionality for TypeScript.

Integrating Parea with DVC

Setup DVC

Install DVC and initialize it via

dvc init

Integrate Parea with DVC

Run the following command to integrate Parea with DVC and commit any files to git which the command creates:

parea dvc-init

This command will check if DVC is installed as well as create a .parea directory with a dvc.yaml and a metrics.json file if they don’t exist. The dvc.yaml file will point to the metrics.json file and the metrics.json file will contain the metrics of the experiments. Both files are necessary for DVC. You can always re-run the command to check if the integration is set up properly.

After integrating Parea with DVC, every time you run an experiment, Parea will capture the state of the workspace together with the associated metrics. To compare all ran experiments since the last commit you can run the following command and get below output:

dvc exp show

To revert the code to the state of an experiment run the following using the run name of the experiment:

dvc exp apply <run-name>

You can learn more about dvc exp show here, and dvc exp apply here.

Add Metadata to Experiments

When running an experiment, you can add metadata to the experiment by passing a dictionary. These metadata will be displayed on the experiment overview table and can be used to filter and search for experiments.

p.experiment(
    ...,
    metadata={"Dataset": "Hello World Test"},
).run()

Comparing Experiments

You can select 2 or more experiments in the experiment overview section and click on the Compare button to compare them. This will open up a new view which shows a high-level comparison of the experiment scores at the top and a side-by-side view of the individual results of the experiments on the same samples at the bottom. The high-level comparison consists of 2-3 cards. The first card is only shown if 2 experiments are compared, and it displays for every evaluation metric how average and standard deviation have changed as well as the number of improvements and regressions. The second card compares the evaluation metric averages for every experiment as a bar plot. The third card shows the distribution plot of the selected evaluation metric as a histogram. Experiment Comparison

Controlling Parallelism

You can specify on how many samples your experiment should be executed in parallel by setting the n_workers / nWorkers parameter when defining an experiment.

p.experiment(
    ...,
    n_workers=3,
)

Integrate into CI/CD

After creating an experiment and evaluating your LLM app, you can integrate the experiment into your CI/CD pipeline as a test.

e = p.experiment(
    name="CI/CD",
    data=data,
    func=func,
)
e.run()
assert all(score > 0.5 for score in e.avg_scores.values()), "Some scores are below 0.5"

Running Experiments in Jupyter Notebooks

In order to execute experiments in a Jupyter notebook, you will need to install nest-asyncio and apply it:

!pip install nest-asyncio
import nest_asyncio
nest_asyncio.apply()

Welcome

Guides

Tutorials

Conceptualizing Testing & Evaluating LLM Applications

Evaluation Metrics

Using Pre-built SOTA Evaluations

Custom Evaluations

Skipping Evaluation Metrics

Return a Reason

Returning Multiple Scores

Running & Naming of Experiments

Organization of Experiments & Historical Trends of Evals

Organize Experiments by Projects

Testing Sub-steps

Trials

Debug Individual Traces

Use Saved Datasets

Dataset Level Evaluation

Investigating Relationship Between Statistics

Experiment Code Management

Add Metadata to Experiments

Comparing Experiments

Controlling Parallelism

Integrate into CI/CD

Running Experiments in Jupyter Notebooks

Welcome

Guides

Tutorials

​Conceptualizing Testing & Evaluating LLM Applications

​Evaluation Metrics

​Using Pre-built SOTA Evaluations

​Custom Evaluations

​Skipping Evaluation Metrics

​Return a Reason

​Returning Multiple Scores

​Running & Naming of Experiments

​Organization of Experiments & Historical Trends of Evals

​Organize Experiments by Projects

​Testing Sub-steps

​Trials

​Debug Individual Traces

​Use Saved Datasets

​Dataset Level Evaluation

​Investigating Relationship Between Statistics

​Sharing Experiments Publicly

​Experiment Code Management

​Add Metadata to Experiments

​Comparing Experiments

​Controlling Parallelism

​Integrate into CI/CD

​Running Experiments in Jupyter Notebooks

Conceptualizing Testing & Evaluating LLM Applications

Evaluation Metrics

Using Pre-built SOTA Evaluations

Custom Evaluations

Skipping Evaluation Metrics

Return a Reason

Returning Multiple Scores

Running & Naming of Experiments

Organization of Experiments & Historical Trends of Evals

Organize Experiments by Projects

Testing Sub-steps

Trials

Debug Individual Traces

Use Saved Datasets

Dataset Level Evaluation

Investigating Relationship Between Statistics

Sharing Experiments Publicly

Experiment Code Management

Add Metadata to Experiments

Comparing Experiments

Controlling Parallelism

Integrate into CI/CD

Running Experiments in Jupyter Notebooks