Evaluation metrics are great for quantitatively measuring the performance of your LLM application. Usually, you want to understand the “accuracy” or “quality” of your application. However, with Generative AI, it takes a lot of work to know which evaluation metrics to use or where to get started.

Parea AI makes it easy to start by providing pre-built use-case-specific evaluation metrics. These aren’t general “high-level” evals, like “toxicity” or “tone,” but metrics based on the latest academic research for use cases like RAG, Chat, Factuality, and Summarization.

However, once the basics are down, you’ll likely want to customize your evaluation metrics to your specific use case. With Parea, you can define your metrics in Python directly in our editor and then use those metrics on the platform or in your code.

Common evaluation use cases:

Reach out for custom evaluation metrics consultation. We can work with you to define evaluation metrics best suited for your use case and grounded in SOTA research and best practices.

Prerequisites

  1. First, you’ll need a Parea API key. See Authentication to get started.

Getting Started