Python
Setup
Set PAREA_API_KEY
as an environment variable, or in a .env file
Install the Parea Package
parea
Parea
The Parea
object is used to initialize automatic tracing of any OpenAI call as well as to interact with the Parea API.
You should initialize it at the beginning of your LLM application with your API key (api_key
). You can organize your
logs and experiments by specifying a project name (project_name
), otherwise they will all appear under the
default
project. You can also specify a cache
to automatically cache any OpenAI
You can define a cache to automatically cache any OpenAI calls via the cache you specified.
wrap_openai_client
This method patches the OpenAI client to automatically trace any OpenAI call made through the client. You only need to call this method if your OpenAI package version is >= 1.0.0, and you are not using the module-level client. Only call this method once after initializing the OpenAI client.
trace
The trace
decorator is used to trace a function, capture it inputs and outputs, as well as apply evaluation functions to its output.
It automatically attaches the current trace to the parent trace, if one exists, or sets it as the current trace.
This creates a nested trace structure, which can be viewed in the logs.
Parameters
name
: The name of the trace. If not provided, the function’s name will be used.tags
: A list of tags to attach to the trace.metadata
: A dictionary of metadata to attach to the trace.target
: An optional ground truth/expected output for the inputs and can be used by evaluation functions.end_user_identifier
: An optional identifier for the end user that is using your application.eval_funcs_names
: A list of names of evaluation functions, created in the Datasets tab, to evaluate on the output of the traced function. They will be applied non-blocking and asynchronously in the backend.eval_funcs
: A list of evaluation functions, in your code, to evaluate on the output of the traced function.access_output_of_func
: An optional function that takes the output of the traced function and returns the value which should be used asoutput
of the function for evaluation functions.apply_eval_frac
: The fraction of times the evaluation functions should be applied. Defaults to 1.0.deployment_id
: The deployment id of a prompt configuration
Experiment
The Experiment
class is used to define an experiment of your LLM application. It is initialized the data to run the
experiment on (data
), and the entry point/function (func
). You can read more about running experiments
here.
run
This method runs the experiment and saves the stats to the experiment_stats
attribute. You can optionally specify the
name
of the experiment as an argument.
parea.evals
We provide off-the-shelf evaluation metrics for general scenarios and special scenarios: RAG, summarization & chat. Oftentimes, they come in the form of a factory which e.g. requires the field names/keys to identify the contexts provided for RAG or the question asked by the user.
The general setup of an evaluation is to receive a Log
data structure and return a float or a boolean.
If it is a factory, the factory will return an evaluation function.
parea.evals.general
General Purpose Evaluation Metrics
levenshtein
This evaluation metric measures the Levenshtein distance between the generated output and the target. It measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to change the generated output to the target. And then normalizes it by the length of the target or the generated output, whichever is longer.
llm_grader_factory
This factory creates an evaluation function that uses an LLM to grade the response of an LLM to a given question. It is based on the paper Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena which intorduces general-purpose zero-shot prompt to rate responses from an LLM to a given question on a scale from 1-10. They find that GPT-4’s ratings agree as much with a human rater as a human annotator agrees with another one (>80%). Further, they observe that the agreement with a human annotator increases as the response rating gets clearer. Additionally, they investigated how much the evaluating LLM overestimated its responses and found that GPT-4 and Claude-1 were the only models that didn’t overestimate themselves.
Parameters
model
: The model which should be used for grading. Currently, only supports OpenAI chat models.question_field
: The key name/field used for the question/query of the user. Defaults to"question"
.
answer_relevancy_factory
This factory creates an evaluation function that measures how relevant the generated response is to the given question. It is based on the paper RAGAS: Automated Evaluation of Retrieval Augmented Generation which suggests using an LLM to generate multiple questions that fit the generated answer and measure the cosine similarity of the generated questions with the original one.
Parameters
question_field
: The key name/field used for the question/query of the user. Defaults to"question"
.n_generations
: The number of questions which should be generated. Defaults to 3.
self_check
Given that many API-based LLMs don’t reliably give access to the log probabilities of the generated tokens, assessing
the certainty of LLM predictions via perplexity isn’t possible.
The SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models paper
suggests measuring the average factuality of every sentence in a generated response. They generate additional responses
from the LLM at a high temperature and check how much every sentence in the original answer is supported by the other generations.
The intuition behind this is that if the LLM knows a fact, it’s more likely to sample it. The authors find that this
works well in detecting non-factual and factual sentences and ranking passages in terms of factuality.
The authors noted that correlation with human judgment doesn’t increase after 4-6 additional
generations when using gpt-3.5-turbo
to evaluate biography generations.
lm_vs_lm_factuality_factory
This factory creates an evaluation function that measures the factuality of an LLM’s response to a given question. It is based on the paper LM vs LM: Detecting Factual Errors via Cross Examination which proposes using another LLM to assess an LLM response’s factuality. To do this, the examining LLM generates follow-up questions to the original response until it can confidently determine the factuality of the response. This method outperforms prompting techniques such as asking the original model, “Are you sure?” or instructing the model to say, “I don’t know,” if it is uncertain.
Parameters
examiner_model
: The model which will examine the original model. Currently, only supports OpenAI chat models.
semantic_similarity_factory
This factory creates an evaluation function that measures the semantic similarity of the generated response to the given question. It calculates the cosine similarity of the embeddings of the generated output and the target answer/ground truth. It transforms the -1 to 1 scale of the cosine similarity to a 0 to 1 scale by adding 1 and dividing by 2.
Parameters
embd_model
: The model which should be used for embedding. Currently, only supports OpenAI embedding models.
Instances of factory
parea.evals.rag
RAG Specific Evaluation Metrics
context_query_relevancy_factory
This factory creates an evaluation function that measures how relevant the retrieved context is to the given question. It is based on the paper RAGAS: Automated Evaluation of Retrieval Augmented Generation which suggests using an LLM to extract any sentence from the retrieved context relevant to the query. Then, calculate the ratio of relevant sentences to the total number of sentences in the retrieved context.
Parameters
question_field
: The key name/field used for the question/query of the user. Defaults to"question"
.context_fields
: A list of key names/fields used for the retrieved contexts. Defaults to["context"]
.
context_ranking_pointwise_factory
This factory creates an evaluation function that measures how well the retrieved contexts are ranked by relevancy to the given query by pointwise estimation of the relevancy of every context to the query. It is based on the paper RAGAS: Automated Evaluation of Retrieval Augmented Generation which suggests using an LLM to check if every extracted context is relevant. Then, they measure how well the contexts are ranked by calculating the mean average precision. Note that this approach considers any two relevant contexts equally important/relevant to the query.
Parameters
question_field
: The key name/field used for the question/query of the user. Defaults to"question"
.context_fields
: A list of key names/fields used for the retrieved contexts. Defaults to["context"]
.ranking_measurement
: Method to calculate ranking. Currently, only supports"average_precision"
.
context_ranking_listwise_factory
This factory creates an evaluation function that measures how well the retrieved contexts are ranked by relevancy to the given query by listwise estimation of the relevancy of every context to the query. It is based on the paper Zero-Shot Listwise Document Reranking with a Large Language Model which suggests using an LLM to rerank a list of contexts and use that to evaluate how well the contexts are ranked by relevancy to the given query. The authors used a progressive listwise reordering if the retrieved contexts don’t fit into the context window of the LLM.
Parameters
question_field
: The key name/field used for the question/query of the user. Defaults to"question"
.context_fields
: A list of key names/fields used for the retrieved contexts. Defaults to["context"]
.ranking_measurement
: The measurement to use for ranking. Currently only supports"ndcg"
.n_contexts_to_rank
: The number of contexts to rank listwise. Defaults to10
.
context_has_answer_factory
This factory creates an evaluation metric which assess whether the given context has the answer to the given question. It is useful to measure the performance of a model in a question-answering task by measuring Hit Rate without the need to know the correct answer.
Parameters
question_field
: The key name/field used for the question/query of the user. Defaults to “question”.model
: The model which should be used for grading. Currently, only supports OpenAI chat models. Defaults to “gpt-3.5-turbo-0125”.
answer_context_faithfulness_binary_factory
This factory creates an evaluation function that classifies if the generated answer was faithful to the given context. It is based on the paper Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering which suggests using an LLM to flag any information in the generated answer that cannot be deduced from the given context. They find that GPT-4 is the best model for this analysis as measured by correlation with human judgment.
Parameters
question_field
: The key name/field used for the question/query of the user. Defaults to"question"
.context_fields
: A list of key names/fields used for the retrieved contexts. Defaults to["context"]
.model
: The model which should be used for grading. Currently, only supports OpenAI chat models. Defaults to"gpt-4"
.
answer_context_faithfulness_precision_factory
This factory creates an evaluation function that calculates the how many tokens in the generated answer are also present in the retrieved context. It is based on the paper Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering which finds that this method only slightly lags behind GPT-4 and outperforms GPT-3.5-turbo (see Table 4 from the above paper).
Parameters
context_field
: The key name/field used for the retrieved context. Defaults to"context"
.
answer_context_faithfulness_statement_level_factory
This factory creates an evaluation function that measures the faithfulness of the generated answer to the given context by measuring how many statements from the generated answer can be inferred from the given context. It is based on the paper RAGAS: Automated Evaluation of Retrieval Augmented Generation which suggests using an LLM to create a list of all statements in the generated answer and assessing whether the given context supports each statement.
Parameters
question_field
: The key name/field used for the question/query of the user. Defaults to"question"
.context_fields
: A list of key names/fields used for the retrieved contexts. Defaults to["context"]
.
parea.evals.chat
AI Assistant/Chatbot-Specific Evaluation Metrics
goal_success_ratio_factory
This factory creates an evaluation function that measures the success ratio of a goal-oriented conversation. Typically, a user interacts with a chatbot or AI assistant to achieve specific goals. This motivates to measure the quality of a chatbot by counting how many messages a user has to send before they reach their goal. One can further break this down by successful and unsuccessful goals to analyze user & LLM behavior.
Concretely:
- Delineate the conversation into segments by splitting them by the goals the user wants to achieve.
- Assess if every goal has been reached.
- Calculate the average number of messages sent per segment.
Parameters
use_output
: Boolean indicating whether to use the output of the log to access the messages. Defaults to False.message_field
: The name of the field in the log that contains the messages. Defaults toNone
. IfNone
, the messages are taken from theconfiguration
attribute.
parea.evals.summary
Evaluation Metrics for Summarization Tasks
factual_inconsistency_binary_factory
This factory creates an evaluation function that classifies if a summary is factually inconsistent with the original text.
It is based on the paper ChatGPT as a Factual Inconsistency Evaluator for Text Summarization
which suggests using an LLM to assess the factuality of a summary by measuring how consistent the summary is with
the original text, posed as a binary classification. They find that gpt-3.5-turbo-0301
outperforms
baseline methods such as SummaC and QuestEval when identifying factually inconsistent summaries.
Parameters
article_field
: The key name/field used for the content which should be summarized. Defaults to"article"
.model
: The model which should be used for grading. Currently, only supports OpenAI chat models. Defaults to"gpt-4"
.
factual_inconsistency_scale_factory
This factory creates an evaluation function that grades the factual consistency of a summary with the article on a scale from 1 to 10.
It is based on the paper ChatGPT as a Factual Inconsistency Evaluator for Text Summarization
which finds that using gpt-3.5-turbo-0301
leads to a higher correlation with human expert judgment when grading
the factuality of summaries on a scale from 1 to 10 than baseline methods such as SummaC and QuestEval.
Parameters
article_field
: The key name/field used for the content which should be summarized. Defaults to"article"
.model
: The model which should be used for grading. Currently, only supports OpenAI chat models. Defaults to"gpt-4"
.
likert_scale_factory
This factory creates an evaluation function that grades the quality of a summary on a Likert scale from 1-5 along
the dimensions of relevance, consistency, fluency, and coherence. It is based on the paper
Human-like Summarization Evaluation with ChatGPT which finds that using gpt-3.5-0301
leads to a highest correlation with human expert judgment when grading summaries on a Likert scale from 1-5 than baseline
methods. Noteworthy is that BARTScore was very competitive to gpt-3.5-0301
.
parea.evals.dataset_level
This module contains pre-built dataset-level evaluation metrics.
balanced_acc_factory
This factory creates an evaluation function that calculates the balanced accuracy of the score score_name
across all
the classes in the dataset.
Parameters
score_name
: The name of the score to calculate the balanced accuracy for.
Parameters
article_field
: The key name/field used for the content which should be summarized. Defaults to"article"
.model
: The model which should be used for grading. Currently, only supports OpenAI chat models. Defaults to"gpt-4"
.
parea.schemas
Data Structures / Schemas
log
Consists of classes which make up the Log
class.
Log
This class encapsulates the logs from a traced function or LLM call. It consists of the following attributes:
configuration
: The configuration of the LLM call if it was an LLM call.inputs
: The key-value pairs of inputs fed into the traced function or, if it was a templated LLM call, the inputs to the prompt template.output
: The output of the traced function or LLM call.target
: The (optional) target/ground truth output of the traced function or LLM call.latency
: The latency of the traced function or LLM call in seconds.input_tokens
: The number of tokens in the inputs if it was an LLM call.output_tokens
: The number of tokens in the output if it was an LLM call.total_tokens
: The total number of tokens in the inputs and output if it was an LLM call.cost
: The cost if it was an LLM call.
LLMInputs
All the input variables which were fed into the LLM call.
ModelParams
The parameters used for the LLM call.
Message
Role
Was this page helpful?