Joschka Braun on Jul 11, 2024

Evaluating generated text is very hard. Once assertion-based evaluation doesn’t suffice anymore, one is restricted to let subject-matter experts manually review responses or to use LLMs to evaluate them. The former is very expensive and slow which limits the amount of experiments one can run and thus the insights one can gain. The latter while being fast & comparatively cheap, is not necessarily aligned with domain expert judgement (cf. JudgeBench) and thus can lead to unreliable results.

To alleviate these limitations, we introduce aligned, self-improving LLM evals. Using manually annotated responses, a LLM eval is created to imitate the human review. In addition this feature unlocks reducing the workload for subject-matter experts annotating responses & increases the iteration speed of LLM teams. This work is greatly inspired by the work of Shreya Shankar, J.D. Zamfirescu-Pereira, Ian Arawjo, and others titled Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences as well as leverages DSPy to automatically create the LLM evals.

Evals Quadrants

Seeing it in action: Judge Bench

Recently, JudgeBench introduced a collection of datasets which measured how well LLMs can evaluate outputs. This is a great opportunity to test our approach. We tested it on 2 of the datasets. For each dataset and annotation criterion for that dataset, we split the data into training and testing samples. Then, we applied the new feature to the training samples to find an optimal prompt to mimic human annotators. For all datasets, we used 25 randomly chosen training samples and reported Cohen’s kappa coefficient on the remaining samples. Also, we used gpt-3.5-turbo-0125 as evaluator.

Datasetgpt-4o [JudgeBench]Our Approach
cola0.340.57
Toxic Chat - Toxicity0.730.63
Average0.540.60

To reproduce results, follow the instructions in this fork of JudgeBench. Stay tuned for the full Judge Bench evaluation results!

How to get started?

You can easily get started automatically creating LLM evals by logging responses to Parea & annotating them in the UI or by uploading a CSV-file of annotations. Then trigger the creation of an LLM eval via the UI. Once done, you can either use the LLM eval via API or copy it into your code to evaluate your generated text. Checkout our docs to see the full workflow.