Seeing it in action: Judge Bench

How to get started?

Automatic creation of LLM evals based on human review of responses

Self-improving, domain-specific LLM Evals

x-api-key

Parea AI

What is Parea AI?

Evaluation Quickstart

Monitor your LLM requests and application functions.

Monitoring Quickstart

Evaluation

Datasets

Deploy versioned prompts and use them via the SDK.

Deployed Prompts

Tutorial on improving a Langchain RAG application using Parea's Evals, Tracing, and Playground.

Optimize a LangChain RAG App

How many samples are necessary to achieve good performance with DSPy?

Optimize a RAG DSPy Application

REST API Walkthrough

Leverage user feedback to run A/B tests of prompts, models & other approaches

A/B Testing of LLM Apps

You will need an API Key and Organization to use the API/SDK.

Authentication

Create a project or get the project if it already exists. Is used to find out UUID of project.

Create or get Project

Health Check

Log a (LLM) span to visualize inference results, or chains.

Record Trace Log

Record any (user) feedback & ground truth/correction of output for a log.

Record Feedback

Update Trace Log

Fetches trace logs and returns them as a paginated response.

Get Trace Logs

Get Trace Log

Create an experiment and get the associated experiment uuid.

Create Experiment

Finishes an experiment, calculates stats and returns all stats for root trace logs

Finish Experiment

Fetches aggregated stats for every root-level trace log in an experiment.

Get Experiment Stats

Fetches all trace logs for an experiment.

Args:
    experiment_uuid (str): UUID of the experiment.

Returns:
    list[TraceLogTreeSchema]: List of trace logs for the experiment.

Get Experiment Trace Logs

Lists experiments given a set of filters incl. their high-level statistics.

List Experiments

Get a completion response using either one of your organizationâ€™s deployed prompts, or by providing completion details including prompt and inputs in the request.
This endpoint acts as a LLM gateway/proxy endpoint to generate completions from different LLMs.

Completion

Get a completion response using either one of your organizationâ€™s deployed prompts, or by providing completion details including prompt and inputs in the request.

This endpoint acts as a LLM gateway/proxy endpoint to generate completions from different LLMs.

Stream Completion

Given a deployment_id, fetches the deployed prompt and its details. Can be optionally used to fill-in the templated prompt with provided inputs.

Fetch Deployed Prompt

Create Collection

Adds items/test cases to a dataset. If the dataset does not exist, it will be created.
Returns the IDs of the created items/test cases.

Returns:
    List[int]: List of IDs of the created items/test cases.

Create Test Cases For Dataset

Fetches dataset/test case collection by its ID or name.

Args:
    test_collection_identifier (Union[str, int]): ID or name of the test case collection/dataset.

Get Test Case Collection By Identifier

Update the item/test case of the dataset with any given fields.

Args:
    dataset_id (int): ID of the dataset.
    test_case_id (int): ID of the item/test case.
    update_test_case_request (UpdateTestCasePublicRequestSchema): Request body containing the fields to update.

Update Test Case

Python

TypeScript

When ground truth values are expensive to obtain & launching fast is important

Hill climbing generative AI problems

LLM logs are useful but it's hard to prioritize which production log to review. High entropy responses are a good starting point.

How to detect unreliable behavior of LLM apps

Practices to improve LLM apps component-wise

Tactics for multi-step AI app experimentation

Practical workflow focusing on continuous improvement and data-driven decision-making.

A Systematic Workflow to build Production-Ready LLM Applications

We will use Instructor in TypeScript to generate synthetic data for a question-answering task.

Synthetic Data Generation for Q&A Tasks

Haiku > GPT-4 Turbo > Opus >>> GPT-3.5 Turbo if not using parallel function calling.

Anthropic's Haiku Beats GPT-4 Turbo in Tool Use - Sometimes

How to measure the performance of retrieval applications without ground truth data.

Building and Evaluating Evals for Retrieval

How to measure the performance of LLM applications with ground truth data.

LLM Evaluation Metrics for Labeled Data

How to measure the performance of LLM applications without ground truth data.

Evaluation Metrics for LLM Applications In Production

How has the ChatGPT model changed from March to June?

Github

Discord

Chat with Founders

API Reference

Blog

HomePage

Overview

Record discrete events or related events in your application from LLM requests/chains to functions.

Logging and Tracing

Enrich events with context and details to facilitate analysis and debugging.

Metadata

Images

Record feedback from users on the quality LLM results.

Feedback

Use details from a trace log to build datasets.

Dataset from Trace

From Trace to Playground

Organize logs, traces & experiments into projects.

Projects

Attach evaluations to a trace to identify failure cases

Evaluations in Trace

OpenAI auto-logging also works with the new OpenAI Assistants Python API.

OpenAI Assistant

Using templated messages in auto-traced OpenAI & Anthropic calls

Templated LLM Calls

Incorporate human review into your AI software evaluation process

Criterion

Logs View

Annotation Queue

Automatically create LLM evals aligned with manual annotations

Self-improving LLM Eval

Compare

Add functions to prompts in the playground.

Function calling

Add evaluation metrics to the playground.

Evaluation metrics

Evaluate your prompts on a dataset with evaluation metrics.

Trigger experiments

Instrumenting LLM frameworks & provider SDKs with Parea AI

Instrumenting your Langchain applications with Parea AI

LangChain

Instrument & test `instructor` code with Parea AI

Instructor

Instrumenting your DSPy application with Parea AI

DSPy

Instrumenting your LiteLLM proxy with Parea AI

LiteLLM Proxy

Instrumenting your SGLang application with Parea AI

SGLang

Instrumenting your OpenAI calls made through Trigger.dev with Parea

Trigger.dev

On-premise deployment of Parea in your environment

On-premise deployment of Parea in your environment via Docker

Dataset	gpt-4o [JudgeBench]	Our Approach
cola	0.34	0.57
Toxic Chat - Toxicity	0.73	0.63
Average	0.54	0.60

Blog

​Seeing it in action: Judge Bench

​How to get started?

Seeing it in action: Judge Bench

How to get started?