> ## Documentation Index
> Fetch the complete documentation index at: https://docs.parea.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# How to detect unreliable behavior of LLM apps

> LLM logs are useful but it's hard to prioritize which production log to review. High entropy responses are a good starting point.

[//]: # "title: \"A Simple Trick to Find Suspicious Prod LLM Logs\""

[//]: # "title: \"Rerank Your Prod Logs!\""

[//]: # "title: \"Find inputs for which your LLM app is uncertain\""

[//]: #

[//]: # "title: \"Unrobust/uncertain data points are good for you\""

[//]: #

[//]: # "title: \"How to find inputs to your LLM app which aren't reliably covered\""

[//]: #

[//]: # "description: \"LLM logs are useful but oftentimes hard to pick. This should help with that\""

[//]: #

[//]: # "description: \"Or a simple trick to find suspicious traces of your LLM application.\""

[Joschka Braun](https://joschkabraun.com) on Aug 6, 2024

[//]: # "<Info>[Sixfold](https://www.sixfold.ai) builds a risk assessment AI solution for insurance underwriters. Recently, we talked to Ian P. Cook (Head of AI at Sixfold) who came up with a simple, yet powerful way to assess the reliability of LLM apps with Parea. This blogpost goes into more detail about that.</Info>"

[//]: # "<Info>This article is based on a conversation with Gagan, an engineer from one of our customers [Sixfold](https://www.sixfold.ai) - they build gen AI tools for insurance underwriters</Info>"

[//]: # "If you have a lot of LLM logs, it can be very hard to make them actionable."

[//]: # "Questions like which logs are important, which logs are suspicious, and which logs are just noise can be hard to answer."

[//]: # "In this article, I will discuss some basic principles on how to find suspicious logs of your LLM application."

[//]: # "Once your LLM app runs in production, it can generate a lot of logs."

[//]: # "While these logs are very valuable for debugging and refining your LLM app, it can be hard to figure out which traces look at in the first place."

[//]: # "In this blog, we will discuss a simple trick to detect unreliable inputs to your LLM application."

[//]: # "In a conversation with Gagan (Eng. @ Sixfold.ai, one of our customers), he brought up a "

[//]: # "After recently chatting with Gagan from Sixfold.ai, I realized that there are a few tricks on how you can find the right logs to look at when building with LLMs."

<Info>
  We help companies build & improve their AI products with our hands-own
  services. Request a consultation
  [here](https://calendly.com/parea-ai/consulting).
</Info>

[Sixfold](https://www.sixfold.ai) builds a risk assessment AI solution for insurance underwriters.
Recently, we talked to their team who came up with a simple, yet powerful way to assess the reliability of LLM apps using Parea.
This blogpost goes into more detail about that.

## Idea: Identify High Entropy Outputs

The idea is to automatically flag any logs which have high entropy responses.
To do that, you can simply rerun your LLM app on that particular sample and measure the difference between the two outputs.
The larger their distance, the more likely it is an input for which your LLM app is unreliable.
In the sections below we will discuss what distance metrics are useful to measure the difference between two responses.

The easiest way to build intuition on this is to think about prompt engineering.
How often have you rerun your prompt and noticed that it only works in half of the cases?
If that is the case, you are dealing with high entropy (uncertainty) responses as the prompt isn't reliable for this input.

[//]: # "Taking the simplest case of an LLM app, a well-crafted LLM prompt."

[//]: # "There are three scenarios of how it can behave given some inputs:"

[//]: # "- it gives a \"good\" answer and is reliable under it"

[//]: # "- it gives a \"good\" answer but is unreliable under retries (**high entropy responses**)"

[//]: # "- it always gives a \"bad\" answer as the prompt hasn't been tuned for this case at all"

[//]: # "The tactics we will describe here are about how to identify inputs of the second category: \"could be good but it's unreliable\"."

[//]: #

[//]: # "Intuition: \"who hasn't rerun their prompt and noticed that it doesn't work anymore?\""

[//]: #

[//]: #

[//]: #

[//]: # "Btw, there is good reason to believe that identify high entropy samples are includes responses of the third category (always bad) unless these results are due to fine-tuned biases of the LLM you are using, then you can also expect reliable but bad results."

[//]: # "Idea 2:"

[//]: # "    The outcome of successful prompt engineering is a prompt which responds as you expected under your known inputs."

[//]: # "    This hopefully also translates to unknown inputs."

[//]: # "    Yet, when the model has not been calibrated correctly for a certain input, it may give more unreliable results."

[//]: # "    If we walk the problem backwards, there are three scenarios of how your LLM"

[//]: # "    One way to find inputs which"

[//]: # "    If we assume the output of prompt engineering is a prompt which is robust under resampling, then we want to identify logs under which the model is uncertain."

[//]: # "    as a producing leading to a prompt to which the LLM responses are stable under retries,"

## Implementation Using Parea

To see how easy this workflow can be implemented with Parea, let's take a look at the following example.
We have some function `llm_app` which we want to test for high entropy responses.
In our evaluation function `is_unreliable_input`, we rerun our function again and compare the new output with the original one.

```python theme={null}
from parea import trace
from parea.schemas import Log

# function which does the work
def llm_app(**kwargs) -> str:
    pass

def is_unreliable_input(log: Log) -> bool:
    # the output of the first execution
    output1 = log.output
    # call the LLM app again
    output2 = llm_app(**log.inputs)
    # return whether the two outputs are different
    return output1 != output2

# trace the function and apply the evaluation function in 10% of the cases
@trace(eval_funcs=[is_unreliable_input], apply_eval_frac=0.1)
def tested_llm_app(**kwargs) -> str:
    return llm_app(**kwargs)
```

The benefit of doing that via `trace` is that our eval is executed in the background (so no additional latency) and we can choose to only apply it in a fraction of the cases (here 10%).

## Metric 1: Levenshtein Distance

A simple way to measure the difference between two strings is to calculate [the Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance).
The Levenshtein distance is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into the other.
We can use it in our eval function in the following way:

```python theme={null}
from parea.evals.general.levenshtein import levenshtein_distance

def is_unreliable_input(log: Log) -> float:
    output1 = log.output
    output2 = llm_app(**log.inputs)
    return 1 - levenshtein_distance(output1, output2)
```

## Metric 2: Semantic Similarity

The drawback of traditional NLP metrics such as the Levenshtein distance is that they are not able to capture the semantic similarity between two strings.
Instead, we can embed both outputs as vector, calculate their cosine similarity and return the inverse of that as our metric.

```python theme={null}
from parea.evals.general import semantic_similarity_factory

semantic_similarity = semantic_similarity_factory(embd_model="text-embedding-3-small")

def is_unreliable_input(log: Log) -> float:
    output1 = log.output
    output2 = llm_app(**log.inputs)
    return 1 - semantic_similarity(Log(output=output1, target=output2))
```

## Metric 3: LLM Judge

Embedding models will give a general sense of how closely related the two outputs are.
Yet, oftentimes you care about specific aspects of the outputs.
For that scenario, you should use a LLM judge and instruct it to compare the two outputs along certain dimensions.

```python theme={null}
from parea.evals import call_openai

def is_unreliable_input(log: Log) -> bool:
    output1 = log.output
    output2 = llm_app(**log.inputs)

    response = call_openai(
        model='gpt-4o',
        messages=[
            {"role": "system",
             "content": "You are CompareGPT, a machine to verify if two responses match. Answer with only yes/no."},
            {
                "role": "user",
                "content": f"""You are given a two responses and you have to decide. Compare "Response A" and "Response B" to determine whether they both match. All information in "Response A" must be present in "Response B", including numbers and dates. You must answer "no" if there are any specific details in "Response A" that are not mentioned in "Response B".

Response A:
{output1}

Response B:
{output2}

CompareGPT response:""",
            },
        ],
        temperature=0.0,
    )
    return "yes" in response.lower()
```

Note, the [SelfCheckGPT](https://arxiv.org/abs/2303.08896) paper explores an adjacent approach to measure factuality of LLM responses.

## Conclusion

A key aspect of improving your LLM app is ensuring it's reliably handling the long tail of uses cases.
One general way to identify unreliable samples is to rerun your LLM app for these inputs and see how similar the outputs are.
You can assess how similar they are by using metrics such as the Levenshtein distance, semantic similarity, or a LLM judge.
