General purpose

Generally applicable evaluation functions:

  • Judge LLM: Uses another LLM to judge the output.
  • Relevance of Generated Response to Query: Creates queries matching to the LLM response and calculates similarity of original query and generated queries.
  • SelfCheck: Samples more responses from LLM to estimate uncertainty of response.
  • LLM vs LLM: Using an examining LLM, measures the factuality of a claim. Examining LLM asks follow-up questions to the other LLM until it reaches a conclusion.
  • Answer Matches Target Recall: Percent of tokens in target/reference answer which are also in model generation.
  • Answer Matches Target LLM Grader: Quantifies how much the generated answer matches the ground truth / target via a LLM.
  • Semantic Similarity: Measures the semantic similarity between generated answer and target/reference answer.

Chatbot apps

To evaluate chatbots, we provide the following evaluation functions:

  • Goal Success Ratio: The average number of queries a user needs to send to reach their goal

RAG apps

To evaluate retrieval augmented generation (RAG) apps, we provide the following evaluation functions:

  • Relevance of Context to Query: Quantifies how much the retrieved context relates to the query.

  • Context Ranking Pointwise: Quantifies how much the retrieved context is ranked by their relevancy.

  • Context Ranking Listwise: Quantifies how much the retrieved context is ranked by their relevancy.

  • Answer Context Faithfulness Binary: Assesses if the generated answer can be inferred from the retrieved context.

  • Answer Context Faithfulnes Precision: Percent of tokens in model generation which are also present in the retrieved context.

  • Answer Context Faithfulnes Statement Level: Quantifies how much every statement of the generated answer can be inferred from the retrieved context.

  • % Target Supported by Context: Percent how many sentences in the target/ground truth are supported by the retrieved context.

Summarization apps

To evaluate summarization apps, we provide the following evaluation functions:

  • Factual Inconsistency (Binary): Measures factual consistency of a summary given a source.
  • Factual Inconsistency (Scale 1-10): Measures factual consistency of a summary given a source.
  • Summary Quality (Likert Scale): Measures the quality of a summary given a source on a Likert scale (dimensions: relevance, consistency, fluency, and coherence).

Read out blog post on the source research behind these metrics: evaluation metrics which don’t require a target answer.