How to measure the performance of LLM applications without ground truth data.
gpt-3.5-turbo
to evaluate biography generations.
Code: here.
gpt-3.5-turbo-0301
to assess the factuality of a summary by measuring how consistent the summary is with
the original text, posed as a binary classification and a grading task. They find that gpt-3.5-turbo-0301
outperforms
baseline methods such as SummaC and QuestEval when identifying factually inconsistent summaries.
They also found that using gpt-3.5-turbo-0301
leads to a higher correlation with human expert judgment when grading
the factuality of summaries on a scale from 1 to 10.
Code: binary classification and 1-10 grading.
gpt-3.5-0301
to evaluate summaries on a Likert scale from 1-5 along the dimensions of relevance, consistency,
fluency, and coherence. They find that this method outperforms other methods in most cases in terms of correlation with
human expert annotation. Noteworthy is that BARTScore was very competitive to gpt-3.5-0301
.
Code: Likert scale grading.