A/B Testing of LLM Apps

After offline testing changes of your LLM app, it can be beneficial to run A/B tests to further validate the updates. In this cookbook, we will use an email generation example to demonstrate how to run A/B tests with Parea’s SDK. Running the A/B test involves three steps:

routing requests to different variants
capturing assocaited feedback
analyzing the results

Sample app: email generation

In our currently existing version of an email generator we instruct the LLM to generate long emails. With our A/B test, we will assess the effect of changing the prompt to generate short emails.

Original email generation code

To instrument our application, we use wrap_openai_client/patchOpenai to automatically trace any LLM calls made by the OpenAI client, and trace to capture the inputs and outputs of the generate_email function.

from openai import OpenAI
from parea import Parea, trace

client = OpenAI()
p = Parea(api_key="<<PAREA_API_KEY>>")
# wrap OpenAI client to trace LLM calls
p.wrap_openai_client(client)

# use @trace to capture inputs, outputs of your function
# and create nested traces
@trace
def generate_email(user: str) -> str:
    prompt = f"Generate a long email for {user}"
    return (
        client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "user",
                    "content": prompt,
                }
            ],
        )
        .choices[0] .message.content
    )

Route requests randomly to the variants

In our A/B test we will test the effect of changing the prompt to generate short emails instead of long emails. To execute the A/B test, called long-vs-short-emails, we will randomly choose to generate a long email (variant_0, control group) or a short email (variant_1, treatment group). Then, we will tag the trace with the A/B test name and the chosen variant via trace_insert. Finally, we will return the email, the trace_id, and the chosen variant. We need to return the latter two in order to associate any feedback with the corresponding variant.

import random
from typing import Tuple
from parea import get_current_trace_id, trace_insert

ab_test_name = 'long-vs-short-emails'

@trace  # decorator to trace functions with Parea
def generate_email(user: str) -> Tuple[str, str, str]:
    # randomly choose to generate a long or short email
    if random.random() < 0.5:
        variant = 'variant_0'
        prompt = f"Generate a long email for {user}"
    else:
        variant = 'variant_1'
        prompt = f"Generate a short email for {user}"
    # tag the requests with the A/B test name & chosen variant
    trace_insert(
        {
            "metadata": {
                "ab_test_name": ab_test_name,
                f"ab_test_{ab_test_name}": variant,
            }
        }
    )

    email = (
        client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "user",
                    "content": prompt,
                }
            ],
        )
        .choices[0] .message.content
    )
    # need to return in addition to the email, the trace_id and the chosen variant
    return email, get_current_trace_id(), variant

Capture the feedback

Now that different requests are routed to different variants, we need to capture the associated feedback. Such feedback could be if the email got a reply or lead to booking a meeting (e.g. in the case of sales automation) or if the user gave a thumbs up or thumbs down (e.g. in the case of an email assistant). To do that, we will use the low-level update_log function of parea_logger to update the trace with the collected feedback as a score.

from parea import parea_logger, UpdateLog, EvaluationResult

def capture_feedback(feedback: float, trace_id: str, ab_test_variant: str) -> None:
    parea_logger.update_log(
        UpdateLog(
            trace_id=trace_id,
            field_name_to_value_map={
                "scores": [
                    EvaluationResult(
                        name=f"ab_test_{ab_test_variant}",
                        score=feedback,
                        reason="any additional user feedback on why it's good/bad"
                    )
                ],
            }
        )
    )

Analyzing the results

Once, the A/B test is live, we can check the results in the dashboard by filtering the logs for metadata key ab_test_name being long-vs-short-emails. A/B test results

Great, we can see that variant_1 (short emails) performs a lot better than variant_0 (long emails)! Checkout the full code below for why this variant is performing better. Note, despite the clearly higher score, never forget to LOOK AT YOUR LOGS to understand what’s happening!

Bonus: capture user corrections

If your application is interactive and enables the user to correct the generated email, you should capture the correction and add it to a dataset. This dataset of user corrected emails will be very useful for any future evals of your LLM app as well as opens the door to fine-tuning your own models. You can capture the correction in Parea by sending it as target in the update_log function:

def capture_feedback(feedback: float, trace_id: str, ab_test_variant: str, user_corrected_email: str = None) -> None:
    field_name_to_value_map = {
        "scores": [
            EvaluationResult(
                name=f"ab_test_{ab_test_variant}",
                score=feedback,
                reason="any additional user feedback on why it's good/bad"
            )
        ],
    }
    if user_corrected_email:
        field_name_to_value_map["target"] = user_corrected_email

    parea_logger.update_log(
        UpdateLog(
            trace_id=trace_id,
            field_name_to_value_map=field_name_to_value_map,
        )
    )

When you open the trace, you will see the user corrected email in the target field. After reviewing it, you can add it to a dataset by clicking on the Add to dataset button or pressing Cmd + D. Trace

Conclusion

In this cookbook, we demonstrated how to run A/B tests to optimize your LLM app based on user feedback. To recap, you need to route requests to the variants you want to test, capture the associated feedback, and analyze the results in the dashboard. If you are able to capture corrections from your users, it is strongly recommended to add them to a dataset for future evaluation.

Full Code

Below you can see the full working code for the A/B test. You also can find them in our Python and TypeScript SDK cookbooks.

from typing import Tuple

import os
import random

from openai import OpenAI

from parea import Parea, get_current_trace_id, trace, trace_insert, parea_logger
from parea.schemas import UpdateLog, EvaluationResult

client = OpenAI()
# instantiate Parea client
p = Parea(api_key=os.getenv("PAREA_API_KEY"))
# wrap OpenAI client to trace calls
p.wrap_openai_client(client)


ab_test_name = 'long-vs-short-emails'


@trace  # decorator to trace functions with Parea
def generate_email(user: str) -> Tuple[str, str, str]:
    # randomly choose to generate a long or short email
    if random.random() < 0.5:
        variant = 'variant_0'
        prompt = f"Generate a long email for {user}"
    else:
        variant = 'variant_1'
        prompt = f"Generate a short email for {user}"
    # tag the requests with the A/B test name & chosen variant
    trace_insert(
        {
            "metadata": {
                "ab_test_name": ab_test_name,
                f"ab_test_{ab_test_name}": variant,
            }
        }
    )

    email = (
        client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "user",
                    "content": prompt,
                }
            ],
        )
        .choices[0] .message.content
    )
    # need to return in addition to the email, the trace_id and the chosen variant
    return email, get_current_trace_id(), variant


def capture_feedback(feedback: float, trace_id: str, ab_test_variant: str, user_corrected_email: str = None) -> None:
    field_name_to_value_map = {
        "scores": [
            EvaluationResult(
                name=f"ab_test_{ab_test_variant}",
                score=feedback,
                reason="any additional user feedback on why it's good/bad"
            )
        ],
    }
    if user_corrected_email:
        field_name_to_value_map["target"] = user_corrected_email

    parea_logger.update_log(
        UpdateLog(
            trace_id=trace_id,
            field_name_to_value_map=field_name_to_value_map,
        )
    )


def main():
    # generate email and get trace ID
    email, trace_id, ab_test_variant = generate_email("Max Mustermann")

    # create a biased feedback for shorter emals
    if ab_test_variant == 'variant_1':
        user_feedback = 0.0 if random.random() < 0.7 else 1.0
    else:
        user_feedback = 0.0 if random.random() < 0.3 else 1.0
    capture_feedback(user_feedback, trace_id, ab_test_variant, "Hi Max")


if __name__ == "__main__":
    main()

Welcome

Guides

Tutorials

A/B Testing of LLM Apps

Sample app: email generation

Route requests randomly to the variants

Capture the feedback

Analyzing the results

Bonus: capture user corrections

Conclusion

Welcome

Guides

Tutorials

​Sample app: email generation

​Route requests randomly to the variants

​Capture the feedback

​Analyzing the results

​Bonus: capture user corrections

​Conclusion

Sample app: email generation

Route requests randomly to the variants

Capture the feedback

Analyzing the results

Bonus: capture user corrections

Conclusion