Setup
SetPAREA_API_KEY
as an environment variable, or in a .env file
parea
Parea
Parea
object is used to initialize automatic tracing of any OpenAI call as well as to interact with the Parea API.
You should initialize it at the beginning of your LLM application with your API key (api_key
). You can organize your
logs and experiments by specifying a project name (project_name
), otherwise they will all appear under the
default
project. You can also specify a cache
to automatically cache any OpenAI
You can define a cache to automatically cache any OpenAI calls via the cache you specified.
wrap_openai_client
trace
trace
decorator is used to trace a function, capture it inputs and outputs, as well as apply evaluation functions to its output.
It automatically attaches the current trace to the parent trace, if one exists, or sets it as the current trace.
This creates a nested trace structure, which can be viewed in the logs.
Parameters
name
: The name of the trace. If not provided, the function’s name will be used.tags
: A list of tags to attach to the trace.metadata
: A dictionary of metadata to attach to the trace.target
: An optional ground truth/expected output for the inputs and can be used by evaluation functions.end_user_identifier
: An optional identifier for the end user that is using your application.eval_funcs_names
: A list of names of evaluation functions, created in the Datasets tab, to evaluate on the output of the traced function. They will be applied non-blocking and asynchronously in the backend.eval_funcs
: A list of evaluation functions, in your code, to evaluate on the output of the traced function.access_output_of_func
: An optional function that takes the output of the traced function and returns the value which should be used asoutput
of the function for evaluation functions.apply_eval_frac
: The fraction of times the evaluation functions should be applied. Defaults to 1.0.deployment_id
: The deployment id of a prompt configuration
Experiment
Experiment
class is used to define an experiment of your LLM application. It is initialized the data to run the
experiment on (data
), and the entry point/function (func
). You can read more about running experiments
here.
run
This method runs the experiment and saves the stats to the experiment_stats
attribute. You can optionally specify the
name
of the experiment as an argument.
parea.evals
We provide off-the-shelf evaluation metrics for general scenarios and special scenarios: RAG, summarization & chat.
Oftentimes, they come in the form of a factory which e.g. requires the field names/keys to identify the contexts
provided for RAG or the question asked by the user.
The general setup of an evaluation is to receive a Log
data structure and return a float or a boolean.
If it is a factory, the factory will return an evaluation function.
parea.evals.general
General Purpose Evaluation Metrics
levenshtein
This evaluation metric measures the Levenshtein distance between the generated output and the target.
It measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to change the generated output to the target.
And then normalizes it by the length of the target or the generated output, whichever is longer.
llm_grader_factory
Parameters
model
: The model which should be used for grading. Currently, only supports OpenAI chat models.question_field
: The key name/field used for the question/query of the user. Defaults to"question"
.
answer_relevancy_factory
Parameters
question_field
: The key name/field used for the question/query of the user. Defaults to"question"
.n_generations
: The number of questions which should be generated. Defaults to 3.
self_check
Given that many API-based LLMs don’t reliably give access to the log probabilities of the generated tokens, assessing
the certainty of LLM predictions via perplexity isn’t possible.
The SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models paper
suggests measuring the average factuality of every sentence in a generated response. They generate additional responses
from the LLM at a high temperature and check how much every sentence in the original answer is supported by the other generations.
The intuition behind this is that if the LLM knows a fact, it’s more likely to sample it. The authors find that this
works well in detecting non-factual and factual sentences and ranking passages in terms of factuality.
The authors noted that correlation with human judgment doesn’t increase after 4-6 additional
generations when using gpt-3.5-turbo
to evaluate biography generations.
lm_vs_lm_factuality_factory
Parameters
examiner_model
: The model which will examine the original model. Currently, only supports OpenAI chat models.
semantic_similarity_factory
Parameters
embd_model
: The model which should be used for embedding. Currently, only supports OpenAI embedding models.
Instances of factory
parea.evals.rag
RAG Specific Evaluation Metrics
context_query_relevancy_factory
Parameters
question_field
: The key name/field used for the question/query of the user. Defaults to"question"
.context_fields
: A list of key names/fields used for the retrieved contexts. Defaults to["context"]
.
context_ranking_pointwise_factory
Parameters
question_field
: The key name/field used for the question/query of the user. Defaults to"question"
.context_fields
: A list of key names/fields used for the retrieved contexts. Defaults to["context"]
.ranking_measurement
: Method to calculate ranking. Currently, only supports"average_precision"
.
context_ranking_listwise_factory
Parameters
question_field
: The key name/field used for the question/query of the user. Defaults to"question"
.context_fields
: A list of key names/fields used for the retrieved contexts. Defaults to["context"]
.ranking_measurement
: The measurement to use for ranking. Currently only supports"ndcg"
.n_contexts_to_rank
: The number of contexts to rank listwise. Defaults to10
.
context_has_answer_factory
Parameters
question_field
: The key name/field used for the question/query of the user. Defaults to “question”.model
: The model which should be used for grading. Currently, only supports OpenAI chat models. Defaults to “gpt-3.5-turbo-0125”.
answer_context_faithfulness_binary_factory
Parameters
question_field
: The key name/field used for the question/query of the user. Defaults to"question"
.context_fields
: A list of key names/fields used for the retrieved contexts. Defaults to["context"]
.model
: The model which should be used for grading. Currently, only supports OpenAI chat models. Defaults to"gpt-4"
.
answer_context_faithfulness_precision_factory
Parameters
context_field
: The key name/field used for the retrieved context. Defaults to"context"
.
answer_context_faithfulness_statement_level_factory
Parameters
question_field
: The key name/field used for the question/query of the user. Defaults to"question"
.context_fields
: A list of key names/fields used for the retrieved contexts. Defaults to["context"]
.
parea.evals.chat
AI Assistant/Chatbot-Specific Evaluation Metrics
goal_success_ratio_factory
- Delineate the conversation into segments by splitting them by the goals the user wants to achieve.
- Assess if every goal has been reached.
- Calculate the average number of messages sent per segment.
Parameters
use_output
: Boolean indicating whether to use the output of the log to access the messages. Defaults to False.message_field
: The name of the field in the log that contains the messages. Defaults toNone
. IfNone
, the messages are taken from theconfiguration
attribute.
parea.evals.summary
Evaluation Metrics for Summarization Tasks
factual_inconsistency_binary_factory
gpt-3.5-turbo-0301
outperforms
baseline methods such as SummaC and QuestEval when identifying factually inconsistent summaries.
Parameters
article_field
: The key name/field used for the content which should be summarized. Defaults to"article"
.model
: The model which should be used for grading. Currently, only supports OpenAI chat models. Defaults to"gpt-4"
.
factual_inconsistency_scale_factory
gpt-3.5-turbo-0301
leads to a higher correlation with human expert judgment when grading
the factuality of summaries on a scale from 1 to 10 than baseline methods such as SummaC and QuestEval.
Parameters
article_field
: The key name/field used for the content which should be summarized. Defaults to"article"
.model
: The model which should be used for grading. Currently, only supports OpenAI chat models. Defaults to"gpt-4"
.
likert_scale_factory
gpt-3.5-0301
leads to a highest correlation with human expert judgment when grading summaries on a Likert scale from 1-5 than baseline
methods. Noteworthy is that BARTScore was very competitive to gpt-3.5-0301
.
parea.evals.dataset_level
This module contains pre-built dataset-level evaluation metrics.
balanced_acc_factory
score_name
across all
the classes in the dataset.
Parameters
score_name
: The name of the score to calculate the balanced accuracy for.
Parameters
article_field
: The key name/field used for the content which should be summarized. Defaults to"article"
.model
: The model which should be used for grading. Currently, only supports OpenAI chat models. Defaults to"gpt-4"
.
parea.schemas
Data Structures / Schemas
log
Consists of classes which make up the Log
class.
Log
configuration
: The configuration of the LLM call if it was an LLM call.inputs
: The key-value pairs of inputs fed into the traced function or, if it was a templated LLM call, the inputs to the prompt template.output
: The output of the traced function or LLM call.target
: The (optional) target/ground truth output of the traced function or LLM call.latency
: The latency of the traced function or LLM call in seconds.input_tokens
: The number of tokens in the inputs if it was an LLM call.output_tokens
: The number of tokens in the output if it was an LLM call.total_tokens
: The total number of tokens in the inputs and output if it was an LLM call.cost
: The cost if it was an LLM call.