Evaluation metrics are great for quantitatively measuring the performance of your LLM application. Usually, you want to understand the “accuracy” or “quality” of your application.

Parea AI makes it easy to start by providing pre-built use-case-specific evaluation metrics. You can use these metrics as-is, customize them for your use case, or build your own from scratch.

Reach out for custom evaluation metrics consultation. We can work with you to define evaluation metrics best suited for your use case and grounded in SOTA research and best practices.

You can create evaluation functions on the platform to use them to benchmark prompts, use them in the playground or attach them to traces. Alternatively, you can define them in your code and log their scores.

Where can I use evaluation functions?

Structure of an Evaluation Function

Evaluation Function

example eval
def my_eval_name(log: Log) -> float:
    return float(log.output == log.target):

To ensure that your evaluation metrics are reusable across the entire Parea ecosystem, and with any LLM models or LLM use cases, we introduced the log parameter. All evaluation functions accept the log parameter, which provides all the needed information to perform an evaluation.

Note All fields are optional, you only need to provide the fields that are relevant to your evaluation function.

Evaluation functions are expected to return floating point scores or booleans.

Log Schema

class Role(str, Enum):
    user = "user"
    assistant = "assistant"
    system = "system"
    example_user = "example_user"
    example_assistant = "example_assistant"

class Message:
    content: str
    role: Role

class ModelParams:
    temp: float = 1.0
    top_p: float = 1.0
    frequency_penalty: float = 0.0
    presence_penalty: float = 0.0
    max_length: Optional[int] = None
    response_format: Optional[dict] = None

class LLMInputs:
    # the name of the LLM model. e.g. "gpt-4-1106-preview", "claude-2", etc.
    model: Optional[str]
    # the name of the LLM provider.
    # One of: ["openai", "azure", "anthropic",
    # "anyscale", "aws", "vertexai", "openrouter"]
    provider: Optional[str]
    # the model specific parameters for the LLM call
    model_params: Optional[ModelParams]
    # the prompts that make up the LLM call,
    # e.g. [{"role": "user", "content": "What is the capital of France?"}]
    messages: Optional[list[Message]]
    # a list of function call JSON schemas following OpenAI format
    functions: Optional[list[dict[str, str]]
    # the name of the function the LLM should call or auto.
    # e.g {"name": "current_weather"} or "auto"
    function_call: Optional[Union[str, dict[str, str]]]

class Log:
    # all the parameters send the LLM provider
    configuration: Optional[LLMInputs]
    # The key-value pairs representing an input name
    # and the corresponding value,
    # e.g. {"query": "What is the capital of France?"}
    inputs: Optional[dict[str, str]]
    # The output of the LLM call
    output: Optional[str]
    # The target/ground truth value for the LLM call
    target: Optional[str]