Evaluation metrics are great for quantitatively measuring the performance of your LLM application. Usually, you want to understand the “accuracy” or “quality” of your application. However, with Generative AI, it takes a lot of work to know which evaluation metrics to use or where to get started.
Parea AI makes it easy to start by providing pre-built use-case-specific evaluation metrics. These aren’t general “high-level” evals, like “toxicity” or “tone,” but metrics based on the latest academic research for use cases like RAG, Chat, Factuality, and Summarization.
However, once the basics are down, you’ll likely want to customize your evaluation metrics to your specific use case. With Parea, you can define your metrics in Python directly in our editor and then use those metrics on the platform or in your code.
Common evaluation use cases:
- Test Driven Development of Prompts in the Playground
- Regression testing of your entire LLM app before deploying a change
- Evaluating traces and view their scores in the logs
- Regression testing of prompts before deploying a change
- First, you’ll need a Parea API key. See Authentication to get started.