We help companies build & improve their AI products with our hands-own
services. Request a consultation
here.
Blog
LLM Evaluation Metrics for Labeled Data
How to measure the performance of LLM applications with ground truth data.
Joschka Braun on Feb 13, 2024
The following is an overview of general purpose evaluation metrics based on
foundational models and
fine-tuned LLMs as well as
RAG specific evaluation metrics.
The evaluation metrics rely on ground truth annotations/reference answers to assess the correctness of the model response.
They were collected from research literature and discussions with other LLM app builders.
Implementation in Python or links to the models are provided where available.

