Automatically create LLM evals aligned with manual annotations
input
, the output
and the annotation
.
Note, there must be exactly one column mapped to output
and one to annotation
.
Additionally, the annotation needs to match the annotation criterion.
That means for categorical annotations, the annotation
column can only take values as the defined categories, and for continuous annotation, the column needs to be a number between the min. and max. score (incl.).
evaluation_metric_names
(Python & Curl) / evalFuncNames
(TypeScript) parameter of the trace
decorator / REST API call.
So, if the annotation criterion is named “coherence”, the LLM eval would be named “coherence (LLM)”.
When you add the eval to the trace decorator or REST API call, Parea will automatically execute the eval on that log.
Below you can see an in-app screenshot of how to run the LLM eval.
Analyze Eval
tab.
In that you will see summary statistics such as (balanced) accuracy, Pearson correlation, and average absolute error.
Additionally, you can see either a confusion matrix or a scatter plot of the manual vs. LLM judge annotation.
A confusion matrix will be shown if the annotation criterion is categorical or continuous with less than 11 unique values.
Run Eval
tab.