Leverage user feedback to run A/B tests of prompts, models & other approaches
After offline testing changes of your LLM app, it can be beneficial to run A/B tests to further validate the updates.
In this cookbook, we will use an email generation example to demonstrate how to run A/B tests with Parea’s SDK.
Running the A/B test involves three steps:
In our currently existing version of an email generator we instruct the LLM to generate long emails.
With our A/B test, we will assess the effect of changing the prompt to generate short emails.
To instrument our application, we use wrap_openai_client/patchOpenai to automatically trace any LLM calls made by the OpenAI client, and trace to capture the inputs and outputs of the generate_email function.
from openai import OpenAIfrom parea import Parea, traceclient = OpenAI()p = Parea(api_key="<<PAREA_API_KEY>>")# wrap OpenAI client to trace LLM callsp.wrap_openai_client(client)# use @trace to capture inputs, outputs of your function# and create nested traces@tracedefgenerate_email(user:str)->str: prompt =f"Generate a long email for {user}"return( client.chat.completions.create( model="gpt-4o", messages=[{"role":"user","content": prompt,}],).choices[0].message.content)
In our A/B test we will test the effect of changing the prompt to generate short emails instead of long emails.
To execute the A/B test, called long-vs-short-emails, we will randomly choose to generate a long email (variant_0, control group) or a short email (variant_1, treatment group).
Then, we will tag the trace with the A/B test name and the chosen variant via trace_insert.
Finally, we will return the email, the trace_id, and the chosen variant.
We need to return the latter two in order to associate any feedback with the corresponding variant.
import randomfrom typing import Tuplefrom parea import get_current_trace_id, trace_insertab_test_name ='long-vs-short-emails'@trace# decorator to trace functions with Pareadefgenerate_email(user:str)-> Tuple[str,str,str]:# randomly choose to generate a long or short emailif random.random()<0.5: variant ='variant_0' prompt =f"Generate a long email for {user}"else: variant ='variant_1' prompt =f"Generate a short email for {user}"# tag the requests with the A/B test name & chosen variant trace_insert({"metadata":{"ab_test_name": ab_test_name,f"ab_test_{ab_test_name}": variant,}}) email =( client.chat.completions.create( model="gpt-4o", messages=[{"role":"user","content": prompt,}],).choices[0].message.content)# need to return in addition to the email, the trace_id and the chosen variantreturn email, get_current_trace_id(), variant
Now that different requests are routed to different variants, we need to capture the associated feedback.
Such feedback could be if the email got a reply or lead to booking a meeting (e.g. in the case of sales automation) or if the user gave a thumbs up or thumbs down (e.g. in the case of an email assistant).
To do that, we will use the low-level update_log function of parea_logger to update the trace with the collected feedback as a score.
from parea import parea_logger, UpdateLog, EvaluationResultdefcapture_feedback(feedback:float, trace_id:str, ab_test_variant:str)->None: parea_logger.update_log( UpdateLog( trace_id=trace_id, field_name_to_value_map={"scores":[ EvaluationResult( name=f"ab_test_{ab_test_variant}", score=feedback, reason="any additional user feedback on why it's good/bad")],}))
Once, the A/B test is live, we can check the results in the dashboard by filtering the logs for metadata key ab_test_name being long-vs-short-emails.
Great, we can see that variant_1 (short emails) performs a lot better than variant_0 (long emails)!
Checkout the full code below for why this variant is performing better.
Note, despite the clearly higher score, never forget to LOOK AT YOUR LOGS to understand what’s happening!
If your application is interactive and enables the user to correct the generated email, you should capture the correction and add it to a dataset.
This dataset of user corrected emails will be very useful for any future evals of your LLM app as well as opens the door to fine-tuning your own models.
You can capture the correction in Parea by sending it as target in the update_log function:
When you open the trace, you will see the user corrected email in the target field.
After reviewing it, you can add it to a dataset by clicking on the Add to dataset button or pressing Cmd + D.
Below you can see the full working code for the A/B test.
You also can find them in our Python and TypeScript SDK cookbooks.
from typing import Tupleimport osimport randomfrom openai import OpenAIfrom parea import Parea, get_current_trace_id, trace, trace_insert, parea_loggerfrom parea.schemas import UpdateLog, EvaluationResultclient = OpenAI()# instantiate Parea clientp = Parea(api_key=os.getenv("PAREA_API_KEY"))# wrap OpenAI client to trace callsp.wrap_openai_client(client)ab_test_name ='long-vs-short-emails'@trace# decorator to trace functions with Pareadefgenerate_email(user:str)-> Tuple[str,str,str]:# randomly choose to generate a long or short emailif random.random()<0.5: variant ='variant_0' prompt =f"Generate a long email for {user}"else: variant ='variant_1' prompt =f"Generate a short email for {user}"# tag the requests with the A/B test name & chosen variant trace_insert({"metadata":{"ab_test_name": ab_test_name,f"ab_test_{ab_test_name}": variant,}}) email =( client.chat.completions.create( model="gpt-4o", messages=[{"role":"user","content": prompt,}],).choices[0].message.content)# need to return in addition to the email, the trace_id and the chosen variantreturn email, get_current_trace_id(), variantdefcapture_feedback(feedback:float, trace_id:str, ab_test_variant:str, user_corrected_email:str=None)->None: field_name_to_value_map ={"scores":[ EvaluationResult( name=f"ab_test_{ab_test_variant}", score=feedback, reason="any additional user feedback on why it's good/bad")],}if user_corrected_email: field_name_to_value_map["target"]= user_corrected_email parea_logger.update_log( UpdateLog( trace_id=trace_id, field_name_to_value_map=field_name_to_value_map,))defmain():# generate email and get trace ID email, trace_id, ab_test_variant = generate_email("Max Mustermann")# create a biased feedback for shorter emalsif ab_test_variant =='variant_1': user_feedback =0.0if random.random()<0.7else1.0else: user_feedback =0.0if random.random()<0.3else1.0 capture_feedback(user_feedback, trace_id, ab_test_variant,"Hi Max")if __name__ =="__main__": main()