How to detect unreliable behavior of LLM apps
LLM logs are useful but it’s hard to prioritize which production log to review. High entropy responses are a good starting point.
Joschka Braun on Aug 6, 2024
Sixfold builds a risk assessment AI solution for insurance underwriters. Recently, we talked to their team who came up with a simple, yet powerful way to assess the reliability of LLM apps using Parea. This blogpost goes into more detail about that.
Idea: Identify High Entropy Outputs
The idea is to automatically flag any logs which have high entropy responses. To do that, you can simply rerun your LLM app on that particular sample and measure the difference between the two outputs. The larger their distance, the more likely it is an input for which your LLM app is unreliable. In the sections below we will discuss what distance metrics are useful to measure the difference between two responses.
The easiest way to build intuition on this is to think about prompt engineering. How often have you rerun your prompt and noticed that it only works in half of the cases? If that is the case, you are dealing with high entropy (uncertainty) responses as the prompt isn’t reliable for this input.
Implementation Using Parea
To see how easy this workflow can be implemented with Parea, let’s take a look at the following example.
We have some function llm_app
which we want to test for high entropy responses.
In our evaluation function is_unreliable_input
, we rerun our function again and compare the new output with the original one.
The benefit of doing that via trace
is that our eval is executed in the background (so no additional latency) and we can choose to only apply it in a fraction of the cases (here 10%).
Metric 1: Levenshtein Distance
A simple way to measure the difference between two strings is to calculate the Levenshtein distance. The Levenshtein distance is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into the other. We can use it in our eval function in the following way:
Metric 2: Semantic Similarity
The drawback of traditional NLP metrics such as the Levenshtein distance is that they are not able to capture the semantic similarity between two strings. Instead, we can embed both outputs as vector, calculate their cosine similarity and return the inverse of that as our metric.
Metric 3: LLM Judge
Embedding models will give a general sense of how closely related the two outputs are. Yet, oftentimes you care about specific aspects of the outputs. For that scenario, you should use a LLM judge and instruct it to compare the two outputs along certain dimensions.
Note, the SelfCheckGPT paper explores an adjacent approach to measure factuality of LLM responses.
Conclusion
A key aspect of improving your LLM app is ensuring it’s reliably handling the long tail of uses cases. One general way to identify unreliable samples is to rerun your LLM app for these inputs and see how similar the outputs are. You can assess how similar they are by using metrics such as the Levenshtein distance, semantic similarity, or a LLM judge.
Was this page helpful?