LLM logs are useful but it’s hard to prioritize which production log to review. High entropy responses are a good starting point.
llm_app
which we want to test for high entropy responses.
In our evaluation function is_unreliable_input
, we rerun our function again and compare the new output with the original one.
trace
is that our eval is executed in the background (so no additional latency) and we can choose to only apply it in a fraction of the cases (here 10%).