Generally applicable evaluation functions:
Judge LLM: Uses another LLM to judge the output.
Relevance of Generated Response to Query: Creates queries matching to the LLM response and calculates similarity of original query and generated queries.
SelfCheck: Samples more responses from LLM to estimate uncertainty of response.
LLM vs LLM: Using an examining LLM, measures the factuality of a claim. Examining LLM asks follow-up questions to the other LLM until it reaches a conclusion.
Answer Matches Target Recall: Percent of tokens in target/reference answer which are also in model generation.
Answer Matches Target LLM Grader: Quantifies how much the generated answer matches the ground truth / target via a LLM.
Semantic Similarity: Measures the semantic similarity between generated answer and target/reference answer.
To evaluate chatbots, we provide the following evaluation functions:
Goal Success Ratio: The average number of queries a user needs to send to reach their goal
To evaluate retrieval augmented generation (RAG) apps, we provide the following evaluation functions:
Relevance of Context to Query: Quantifies how much the retrieved context relates to the query.
Context Ranking Pointwise: Quantifies how much the retrieved context is ranked by their relevancy.
Context Ranking Listwise: Quantifies how much the retrieved context is ranked by their relevancy.
Answer Context Faithfulness Binary: Assesses if the generated answer can be inferred from the retrieved context.
Answer Context Faithfulnes Precision: Percent of tokens in model generation which are also present in the retrieved context.
Answer Context Faithfulnes Statement Level: Quantifies how much every statement of the generated answer can be inferred from the retrieved context.
% Target Supported by Context: Percent how many sentences in the target/ground truth are supported by the retrieved context.
To evaluate summarization apps, we provide the following evaluation functions:
Factual Inconsistency (Binary): Measures factual consistency of a summary given a source.
Factual Inconsistency (Scale 1-10): Measures factual consistency of a summary given a source.
Summary Quality (Likert Scale): Measures the quality of a summary given a source on a Likert scale (dimensions: relevance, consistency, fluency, and coherence).
Read out blog post on the source research behind these metrics: evaluation metrics which don’t require a target answer.