After offline testing changes of your LLM app, it can be beneficial to run A/B tests to further validate the updates. In this cookbook, we will use an email generation example to demonstrate how to run A/B tests with Parea’s SDK. Running the A/B test involves three steps:

Sample app: email generation

In our currently existing version of an email generator we instruct the LLM to generate long emails. With our A/B test, we will assess the effect of changing the prompt to generate short emails.

Route requests randomly to the variants

In our A/B test we will test the effect of changing the prompt to generate short emails instead of long emails. To execute the A/B test, called long-vs-short-emails, we will randomly choose to generate a long email (variant_0, control group) or a short email (variant_1, treatment group). Then, we will tag the trace with the A/B test name and the chosen variant via trace_insert. Finally, we will return the email, the trace_id, and the chosen variant. We need to return the latter two in order to associate any feedback with the corresponding variant.

Capture the feedback

Now that different requests are routed to different variants, we need to capture the associated feedback. Such feedback could be if the email got a reply or lead to booking a meeting (e.g. in the case of sales automation) or if the user gave a thumbs up or thumbs down (e.g. in the case of an email assistant). To do that, we will use the low-level update_log function of parea_logger to update the trace with the collected feedback as a score.

Analyzing the results

Once, the A/B test is live, we can check the results in the dashboard by filtering the logs for metadata key ab_test_name being long-vs-short-emails.

Great, we can see that variant_1 (short emails) performs a lot better than variant_0 (long emails)! Checkout the full code below for why this variant is performing better. Note, despite the clearly higher score, never forget to LOOK AT YOUR LOGS to understand what’s happening!

Bonus: capture user corrections

If your application is interactive and enables the user to correct the generated email, you should capture the correction and add it to a dataset. This dataset of user corrected emails will be very useful for any future evals of your LLM app as well as opens the door to fine-tuning your own models. You can capture the correction in Parea by sending it as target in the update_log function:

When you open the trace, you will see the user corrected email in the target field. After reviewing it, you can add it to a dataset by clicking on the Add to dataset button or pressing Cmd + D.

Conclusion

In this cookbook, we demonstrated how to run A/B tests to optimize your LLM app based on user feedback. To recap, you need to route requests to the variants you want to test, capture the associated feedback, and analyze the results in the dashboard. If you are able to capture corrections from your users, it is strongly recommended to add them to a dataset for future evaluation.