PAREA_API_KEY
as an environment variable, or in a .env file
parea
Parea
Parea
object is used to initialize automatic tracing of any OpenAI call as well as to interact with the Parea API.
You should initialize it at the beginning of your LLM application with your API key (api_key
). You can organize your
logs and experiments by specifying a project name (project_name
), otherwise they will all appear under the
default
project. You can also specify a cache
to automatically cache any OpenAI
You can define a cache to automatically cache any OpenAI calls via the cache you specified.
wrap_openai_client
trace
trace
decorator is used to trace a function, capture it inputs and outputs, as well as apply evaluation functions to its output.
It automatically attaches the current trace to the parent trace, if one exists, or sets it as the current trace.
This creates a nested trace structure, which can be viewed in the logs.
name
: The name of the trace. If not provided, the function’s name will be used.tags
: A list of tags to attach to the trace.metadata
: A dictionary of metadata to attach to the trace.target
: An optional ground truth/expected output for the inputs and can be used by evaluation functions.end_user_identifier
: An optional identifier for the end user that is using your application.eval_funcs_names
: A list of names of evaluation functions, created in the Datasets tab, to evaluate on the output of the traced function. They will be applied non-blocking and asynchronously in the backend.eval_funcs
: A list of evaluation functions, in your code, to evaluate on the output of the traced function.access_output_of_func
: An optional function that takes the output of the traced function and returns the value which should be used as output
of the function for evaluation functions.apply_eval_frac
: The fraction of times the evaluation functions should be applied. Defaults to 1.0.deployment_id
: The deployment id of a prompt configurationExperiment
Experiment
class is used to define an experiment of your LLM application. It is initialized the data to run the
experiment on (data
), and the entry point/function (func
). You can read more about running experiments
here.
run
experiment_stats
attribute. You can optionally specify the
name
of the experiment as an argument.
parea.evals
Log
data structure and return a float or a boolean.
If it is a factory, the factory will return an evaluation function.
parea.evals.general
levenshtein
llm_grader_factory
model
: The model which should be used for grading. Currently, only supports OpenAI chat models.question_field
: The key name/field used for the question/query of the user. Defaults to "question"
.answer_relevancy_factory
question_field
: The key name/field used for the question/query of the user. Defaults to "question"
.n_generations
: The number of questions which should be generated. Defaults to 3.self_check
gpt-3.5-turbo
to evaluate biography generations.
lm_vs_lm_factuality_factory
examiner_model
: The model which will examine the original model. Currently, only supports OpenAI chat models.semantic_similarity_factory
embd_model
: The model which should be used for embedding. Currently, only supports OpenAI embedding models.parea.evals.rag
context_query_relevancy_factory
question_field
: The key name/field used for the question/query of the user. Defaults to "question"
.context_fields
: A list of key names/fields used for the retrieved contexts. Defaults to ["context"]
.context_ranking_pointwise_factory
question_field
: The key name/field used for the question/query of the user. Defaults to "question"
.context_fields
: A list of key names/fields used for the retrieved contexts. Defaults to ["context"]
.ranking_measurement
: Method to calculate ranking. Currently, only supports "average_precision"
.context_ranking_listwise_factory
question_field
: The key name/field used for the question/query of the user. Defaults to "question"
.context_fields
: A list of key names/fields used for the retrieved contexts. Defaults to ["context"]
.ranking_measurement
: The measurement to use for ranking. Currently only supports "ndcg"
.n_contexts_to_rank
: The number of contexts to rank listwise. Defaults to 10
.context_has_answer_factory
question_field
: The key name/field used for the question/query of the user. Defaults to “question”.model
: The model which should be used for grading. Currently, only supports OpenAI chat models. Defaults to “gpt-3.5-turbo-0125”.answer_context_faithfulness_binary_factory
question_field
: The key name/field used for the question/query of the user. Defaults to "question"
.context_fields
: A list of key names/fields used for the retrieved contexts. Defaults to ["context"]
.model
: The model which should be used for grading. Currently, only supports OpenAI chat models. Defaults to "gpt-4"
.answer_context_faithfulness_precision_factory
context_field
: The key name/field used for the retrieved context. Defaults to "context"
.answer_context_faithfulness_statement_level_factory
question_field
: The key name/field used for the question/query of the user. Defaults to "question"
.context_fields
: A list of key names/fields used for the retrieved contexts. Defaults to ["context"]
.parea.evals.chat
goal_success_ratio_factory
use_output
: Boolean indicating whether to use the output of the log to access the messages. Defaults to False.message_field
: The name of the field in the log that contains the messages. Defaults to None
. If None
, the messages are taken from the configuration
attribute.parea.evals.summary
factual_inconsistency_binary_factory
gpt-3.5-turbo-0301
outperforms
baseline methods such as SummaC and QuestEval when identifying factually inconsistent summaries.
article_field
: The key name/field used for the content which should be summarized. Defaults to "article"
.model
: The model which should be used for grading. Currently, only supports OpenAI chat models. Defaults to "gpt-4"
.factual_inconsistency_scale_factory
gpt-3.5-turbo-0301
leads to a higher correlation with human expert judgment when grading
the factuality of summaries on a scale from 1 to 10 than baseline methods such as SummaC and QuestEval.
article_field
: The key name/field used for the content which should be summarized. Defaults to "article"
.model
: The model which should be used for grading. Currently, only supports OpenAI chat models. Defaults to "gpt-4"
.likert_scale_factory
gpt-3.5-0301
leads to a highest correlation with human expert judgment when grading summaries on a Likert scale from 1-5 than baseline
methods. Noteworthy is that BARTScore was very competitive to gpt-3.5-0301
.
parea.evals.dataset_level
balanced_acc_factory
score_name
across all
the classes in the dataset.
score_name
: The name of the score to calculate the balanced accuracy for.article_field
: The key name/field used for the content which should be summarized. Defaults to "article"
.model
: The model which should be used for grading. Currently, only supports OpenAI chat models. Defaults to "gpt-4"
.parea.schemas
log
Log
class.
Log
configuration
: The configuration of the LLM call if it was an LLM call.inputs
: The key-value pairs of inputs fed into the traced function or, if it was a templated LLM call, the inputs to the prompt template.output
: The output of the traced function or LLM call.target
: The (optional) target/ground truth output of the traced function or LLM call.latency
: The latency of the traced function or LLM call in seconds.input_tokens
: The number of tokens in the inputs if it was an LLM call.output_tokens
: The number of tokens in the output if it was an LLM call.total_tokens
: The total number of tokens in the inputs and output if it was an LLM call.cost
: The cost if it was an LLM call.LLMInputs
ModelParams
Message
Role