After offline testing changes of your LLM app, it can be beneficial to run A/B tests to further validate the updates. In this cookbook, we will use an email generation example to demonstrate how to run A/B tests with Parea’s SDK. Running the A/B test involves three steps:

Sample app: email generation

In our currently existing version of an email generator we instruct the LLM to generate long emails. With our A/B test, we will assess the effect of changing the prompt to generate short emails.

Route requests randomly to the variants

In our A/B test we will test the effect of changing the prompt to generate short emails instead of long emails. To execute the A/B test, called long-vs-short-emails, we will randomly choose to generate a long email (variant_0, control group) or a short email (variant_1, treatment group). Then, we will tag the trace with the A/B test name and the chosen variant via trace_insert. Finally, we will return the email, the trace_id, and the chosen variant. We need to return the latter two in order to associate any feedback with the corresponding variant.

import random
from typing import Tuple
from parea import get_current_trace_id, trace_insert

ab_test_name = 'long-vs-short-emails'

@trace  # decorator to trace functions with Parea
def generate_email(user: str) -> Tuple[str, str, str]:
    # randomly choose to generate a long or short email
    if random.random() < 0.5:
        variant = 'variant_0'
        prompt = f"Generate a long email for {user}"
    else:
        variant = 'variant_1'
        prompt = f"Generate a short email for {user}"
    # tag the requests with the A/B test name & chosen variant
    trace_insert(
        {
            "metadata": {
                "ab_test_name": ab_test_name,
                f"ab_test_{ab_test_name}": variant,
            }
        }
    )

    email = (
        client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "user",
                    "content": prompt,
                }
            ],
        )
        .choices[0] .message.content
    )
    # need to return in addition to the email, the trace_id and the chosen variant
    return email, get_current_trace_id(), variant

Capture the feedback

Now that different requests are routed to different variants, we need to capture the associated feedback. Such feedback could be if the email got a reply or lead to booking a meeting (e.g. in the case of sales automation) or if the user gave a thumbs up or thumbs down (e.g. in the case of an email assistant). To do that, we will use the low-level update_log function of parea_logger to update the trace with the collected feedback as a score.

from parea import parea_logger, UpdateLog, EvaluationResult

def capture_feedback(feedback: float, trace_id: str, ab_test_variant: str) -> None:
    parea_logger.update_log(
        UpdateLog(
            trace_id=trace_id,
            field_name_to_value_map={
                "scores": [
                    EvaluationResult(
                        name=f"ab_test_{ab_test_variant}",
                        score=feedback,
                        reason="any additional user feedback on why it's good/bad"
                    )
                ],
            }
        )
    )

Analyzing the results

Once, the A/B test is live, we can check the results in the dashboard by filtering the logs for metadata key ab_test_name being long-vs-short-emails.

Great, we can see that variant_1 (short emails) performs a lot better than variant_0 (long emails)! Checkout the full code below for why this variant is performing better. Note, despite the clearly higher score, never forget to LOOK AT YOUR LOGS to understand what’s happening!

Bonus: capture user corrections

If your application is interactive and enables the user to correct the generated email, you should capture the correction and add it to a dataset. This dataset of user corrected emails will be very useful for any future evals of your LLM app as well as opens the door to fine-tuning your own models. You can capture the correction in Parea by sending it as target in the update_log function:

def capture_feedback(feedback: float, trace_id: str, ab_test_variant: str, user_corrected_email: str = None) -> None:
    field_name_to_value_map = {
        "scores": [
            EvaluationResult(
                name=f"ab_test_{ab_test_variant}",
                score=feedback,
                reason="any additional user feedback on why it's good/bad"
            )
        ],
    }
    if user_corrected_email:
        field_name_to_value_map["target"] = user_corrected_email

    parea_logger.update_log(
        UpdateLog(
            trace_id=trace_id,
            field_name_to_value_map=field_name_to_value_map,
        )
    )

When you open the trace, you will see the user corrected email in the target field. After reviewing it, you can add it to a dataset by clicking on the Add to dataset button or pressing Cmd + D.

Conclusion

In this cookbook, we demonstrated how to run A/B tests to optimize your LLM app based on user feedback. To recap, you need to route requests to the variants you want to test, capture the associated feedback, and analyze the results in the dashboard. If you are able to capture corrections from your users, it is strongly recommended to add them to a dataset for future evaluation.