Evaluations in Trace
Attach evaluations to a trace to identify failure cases
You can attach evaluation metrics to a trace to quantify the quality of the
respective component of your LLM app, i.e., perform online evaluation. This e.g. allows you to filter the dashboard by low scores.
The scores for any step of a trace are visualized on the right side of a trace (top image).
All scores are aggregated across logs by time in a chart at the top of the dashboard (bottom image).
Note, the logs of the evaluation are automatically attached to the trace in Python.
You can deactivate this behavior by setting the environment variable TURN_OFF_PAREA_EVAL_LOGGING
to True
.
There are two ways to attach evaluations to a trace:
- Using evaluation functions from your code base
- Using evaluation functions created on the platform
Using evaluation functions from your code base
You can define evaluations functions locally in your codebase.
The evaluation is required to receive a Log
object and return a float
or boolean
value.
The evaluation function will be executed non-blocking in a separate thread and the results will be logged.
An example implementation is shown below:
For a full working example checkout Python cookbook.
Using Pre-built SOTA evaluation functions
Parea provides a set of state-of-the-art evaluation metrics you can plug into your evaluation process. Their motivation & research are discussed in the blog post on reference-free and reference-based evaluation metrics. Here is an overview of them:
You can reuse these evals in Python by importing the respective evaluation function from the parea.evals
module
and attaching them to the trace
decorator.
Using evaluation functions created on the platform
After creating an evaluation function on the platform, you can use it to automatically track
the performance of the components of your LLM app.
For that, simply wrap the function you want to track with the trace
decorator and the evaluation function will be executed in the backend in a non-blocking way:
For a full example you can view our Python cookbook or Typescript cookbook
Run Evaluations on sample of logs
You can also limit how many logs you run evaluations on by setting a sampling rate using the apply_eval_frac
or applyEvalFrac
argument for Python and Typescript, respectively. This is useful if you want to reduce the cost of evaluating.
Was this page helpful?