How to measure the performance of retrieval applications without ground truth data.
thoughts
(which gives the model the ability to think before deciding)
and a field called final_verdict
(which is used to parse the decision of the LLM).
This is encapsulated in Parea’s pre-built LLM evaluation (implementation
and docs), which leverages gpt-3.5-turbo-0125 as the default LLM.
To improve the accuracy of the LLM-based eval metric, few-shot examples are used. We select these samples:
0_shot
1_shot_false_sample_1
0_shot
with few-shot example 11_shot_false_sample_2
0_shot
with few-shot example 22_shot_false_1_false_2
0_shot
with first few-shot example 1, then 22_shot_false_2_false_1
0_shot
with first few-shot example 2, then 1thoughts
field in the response.
In the bar plots below, you can see the effect on the accuracy of the eval metric when not using the thoughts
field (blue) and when using the thoughts
field (orange).
While there is a positive effect on the Q&A subset (1st plot below), the effect is less pronounced than on the Paraphrasing subset (2nd plot below), where improvements are up to 17% in absolute accuracy (4th bar).
In particular, it’s interesting how the effectiveness of chain-of-thought increases when adding few-shot examples (bars 2 to 4 in lower plot).