Setup

Set PAREA_API_KEY as an environment variable, or in a .env file

export PAREA_API_KEY=<your API key>

Install the Parea Package

pip install parea-ai

parea

Parea

@define
class Parea:
    api_key: str = field(init=True, default=os.getenv("PAREA_API_KEY"))
    project_name: str = field(init=True, default="default")
    cache: Cache = field(init=True, default=None)

The Parea object is used to initialize automatic tracing of any OpenAI call as well as to interact with the Parea API. You should initialize it at the beginning of your LLM application with your API key (api_key). You can organize your logs and experiments by specifying a project name (project_name), otherwise they will all appear under the default project. You can also specify a cache to automatically cache any OpenAI You can define a cache to automatically cache any OpenAI calls via the cache you specified.

wrap_openai_client

def wrap_openai_client(self, client: OpenAI) -> None:

This method patches the OpenAI client to automatically trace any OpenAI call made through the client. You only need to call this method if your OpenAI package version is >= 1.0.0, and you are not using the module-level client. Only call this method once after initializing the OpenAI client.

trace

def trace(
    name: Optional[str] = None,
    tags: Optional[list[str]] = None,
    metadata: Optional[dict[str, Any]] = None,
    target: Optional[str] = None,
    end_user_identifier: Optional[str] = None,
    eval_funcs_names: Optional[list[str]] = None,
    eval_funcs: Optional[list[Callable]] = None,
    access_output_of_func: Optional[Callable] = None,
    apply_eval_frac: Optional[float] = 1.0,
    deployment_id: Optional[str] = None,

):


The `trace` decorator is used to trace a function, capture it inputs and outputs, as well as apply evaluation functions to its output.
It automatically attaches the current trace to the parent trace, if one exists, or sets it as the current trace.
This creates a nested trace structure, which can be viewed in the logs.

#### Parameters

- `name`: The name of the trace. If not provided, the function's name will be used.
- `tags`: A list of tags to attach to the trace.
- `metadata`: A dictionary of metadata to attach to the trace.
- `target`: An optional ground truth/expected output for the inputs and can be used by evaluation functions.
- `end_user_identifier`: An optional identifier for the end user that is using your application.
- `eval_funcs_names`: A list of names of evaluation functions, created in the Datasets tab, to evaluate on the output of the traced function. They will be applied non-blocking and asynchronously in the backend.
- `eval_funcs`: A list of evaluation functions, in your code, to evaluate on the output of the traced function.
- `access_output_of_func`: An optional function that takes the output of the traced function and returns the value which should be used as `output` of the function for evaluation functions.
- `apply_eval_frac`: The fraction of times the evaluation functions should be applied. Defaults to 1.0.
- `deployment_id`: The deployment id of a prompt configuration


### `Experiment`

```python
@define
class Experiment:
    name: str = field(init=False)
    data: Iterator[Dict] = field(init=True)
    func: Callable = field(init=True)
    experiment_stats: ExperimentStatsSchema = field(init=False, default=None)

The Experiment class is used to define an experiment of your LLM application. It is initialized the data to run the experiment on (data), and the entry point/function (func). You can read more about running experiments here.

run

This method runs the experiment and saves the stats to the experiment_stats attribute. You can optionally specify the name of the experiment as an argument.

parea.evals

We provide off-the-shelf evaluation metrics for general scenarios and special scenarios: RAG, summarization & chat. Oftentimes, they come in the form of a factory which e.g. requires the field names/keys to identify the contexts provided for RAG or the question asked by the user.

The general setup of an evaluation is to receive a Log data structure and return a float or a boolean. If it is a factory, the factory will return an evaluation function.

parea.evals.general

General Purpose Evaluation Metrics

levenshtein

This evaluation metric measures the Levenshtein distance between the generated output and the target. It measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to change the generated output to the target. And then normalizes it by the length of the target or the generated output, whichever is longer.

llm_grader_factory

def llm_grader_factory(
    model: str,
    question_field: str = "question"
) -> Callable[[Log], float]:

This factory creates an evaluation function that uses an LLM to grade the response of an LLM to a given question. It is based on the paper Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena which intorduces general-purpose zero-shot prompt to rate responses from an LLM to a given question on a scale from 1-10. They find that GPT-4’s ratings agree as much with a human rater as a human annotator agrees with another one (>80%). Further, they observe that the agreement with a human annotator increases as the response rating gets clearer. Additionally, they investigated how much the evaluating LLM overestimated its responses and found that GPT-4 and Claude-1 were the only models that didn’t overestimate themselves.

Parameters

  • model: The model which should be used for grading. Currently, only supports OpenAI chat models.
  • question_field: The key name/field used for the question/query of the user. Defaults to "question".

answer_relevancy_factory

def answer_relevancy_factory(
    question_field: str = "question",
    n_generations: int = 3
) -> Callable[[Log], float]:

This factory creates an evaluation function that measures how relevant the generated response is to the given question. It is based on the paper RAGAS: Automated Evaluation of Retrieval Augmented Generation which suggests using an LLM to generate multiple questions that fit the generated answer and measure the cosine similarity of the generated questions with the original one.

Parameters

  • question_field: The key name/field used for the question/query of the user. Defaults to "question".
  • n_generations: The number of questions which should be generated. Defaults to 3.

self_check

Given that many API-based LLMs don’t reliably give access to the log probabilities of the generated tokens, assessing the certainty of LLM predictions via perplexity isn’t possible. The SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models paper suggests measuring the average factuality of every sentence in a generated response. They generate additional responses from the LLM at a high temperature and check how much every sentence in the original answer is supported by the other generations. The intuition behind this is that if the LLM knows a fact, it’s more likely to sample it. The authors find that this works well in detecting non-factual and factual sentences and ranking passages in terms of factuality. The authors noted that correlation with human judgment doesn’t increase after 4-6 additional generations when using gpt-3.5-turbo to evaluate biography generations.

lm_vs_lm_factuality_factory

def lm_vs_lm_factuality_factory(examiner_model: str = "gpt-3.5-turbo") -> Callable[[Log], float]:

This factory creates an evaluation function that measures the factuality of an LLM’s response to a given question. It is based on the paper LM vs LM: Detecting Factual Errors via Cross Examination which proposes using another LLM to assess an LLM response’s factuality. To do this, the examining LLM generates follow-up questions to the original response until it can confidently determine the factuality of the response. This method outperforms prompting techniques such as asking the original model, “Are you sure?” or instructing the model to say, “I don’t know,” if it is uncertain.

Parameters

  • examiner_model: The model which will examine the original model. Currently, only supports OpenAI chat models.

semantic_similarity_factory

def semantic_similarity_factory(embd_model: str = "text-embedding-3-small") -> Callable[[Log], float]:

This factory creates an evaluation function that measures the semantic similarity of the generated response to the given question. It calculates the cosine similarity of the embeddings of the generated output and the target answer/ground truth. It transforms the -1 to 1 scale of the cosine similarity to a 0 to 1 scale by adding 1 and dividing by 2.

Parameters

  • embd_model: The model which should be used for embedding. Currently, only supports OpenAI embedding models.

Instances of factory

semantic_similarity_oai_3_small = semantic_similarity_factory(embd_model="text-embedding-3-small")
semantic_similarity_oai_3_large = semantic_similarity_factory(embd_model="text-embedding-3-large")
semantic_similarity_oai_ada_002 = semantic_similarity_factory(embd_model="text-embedding-ada-002")

parea.evals.rag

RAG Specific Evaluation Metrics

context_query_relevancy_factory

def context_query_relevancy_factory(
    question_field: str = "question",
    context_fields: List[str] = ["context"]
) -> Callable[[Log], float]:

This factory creates an evaluation function that measures how relevant the retrieved context is to the given question. It is based on the paper RAGAS: Automated Evaluation of Retrieval Augmented Generation which suggests using an LLM to extract any sentence from the retrieved context relevant to the query. Then, calculate the ratio of relevant sentences to the total number of sentences in the retrieved context.

Parameters

  • question_field: The key name/field used for the question/query of the user. Defaults to "question".
  • context_fields: A list of key names/fields used for the retrieved contexts. Defaults to ["context"].

context_ranking_pointwise_factory

def context_ranking_pointwise_factory(
    question_field: str = "question",
    context_fields: List[str] = ["context"],
    ranking_measurement="average_precision"
) -> Callable[[Log], float]:

This factory creates an evaluation function that measures how well the retrieved contexts are ranked by relevancy to the given query by pointwise estimation of the relevancy of every context to the query. It is based on the paper RAGAS: Automated Evaluation of Retrieval Augmented Generation which suggests using an LLM to check if every extracted context is relevant. Then, they measure how well the contexts are ranked by calculating the mean average precision. Note that this approach considers any two relevant contexts equally important/relevant to the query.

Parameters

  • question_field: The key name/field used for the question/query of the user. Defaults to "question".
  • context_fields: A list of key names/fields used for the retrieved contexts. Defaults to ["context"].
  • ranking_measurement: Method to calculate ranking. Currently, only supports "average_precision".

context_ranking_listwise_factory

def context_ranking_listwise_factory(
    question_field: str = "question",
    context_fields: List[str] = ["context"],
    ranking_measurement="ndcg",
    n_contexts_to_rank=10,
) -> Callable[[Log], float]:

This factory creates an evaluation function that measures how well the retrieved contexts are ranked by relevancy to the given query by listwise estimation of the relevancy of every context to the query. It is based on the paper Zero-Shot Listwise Document Reranking with a Large Language Model which suggests using an LLM to rerank a list of contexts and use that to evaluate how well the contexts are ranked by relevancy to the given query. The authors used a progressive listwise reordering if the retrieved contexts don’t fit into the context window of the LLM.

Parameters

  • question_field: The key name/field used for the question/query of the user. Defaults to "question".
  • context_fields: A list of key names/fields used for the retrieved contexts. Defaults to ["context"].
  • ranking_measurement: The measurement to use for ranking. Currently only supports "ndcg".
  • n_contexts_to_rank: The number of contexts to rank listwise. Defaults to 10.

context_has_answer_factory

def context_has_answer_factory(
    question_field: Optional[str] = "question",
    model: Optional[str] = "gpt-3.5-turbo-0125"
    ) -> Callable[[Log], bool]:

This factory creates an evaluation metric which assess whether the given context has the answer to the given question. It is useful to measure the performance of a model in a question-answering task by measuring Hit Rate without the need to know the correct answer.

Parameters

  • question_field: The key name/field used for the question/query of the user. Defaults to “question”.
  • model: The model which should be used for grading. Currently, only supports OpenAI chat models. Defaults to “gpt-3.5-turbo-0125”.

answer_context_faithfulness_binary_factory

def answer_context_faithfulness_binary_factory(
    question_field: Optional[str] = "question",
    context_field: Optional[str] = "context",
    model: Optional[str] = "gpt-4",
) -> Callable[[Log], float]:

This factory creates an evaluation function that classifies if the generated answer was faithful to the given context. It is based on the paper Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering which suggests using an LLM to flag any information in the generated answer that cannot be deduced from the given context. They find that GPT-4 is the best model for this analysis as measured by correlation with human judgment.

Parameters

  • question_field: The key name/field used for the question/query of the user. Defaults to "question".
  • context_fields: A list of key names/fields used for the retrieved contexts. Defaults to ["context"].
  • model: The model which should be used for grading. Currently, only supports OpenAI chat models. Defaults to "gpt-4".

answer_context_faithfulness_precision_factory

def answer_context_faithfulness_precision_factory(
    context_field: Optional[str] = "context"
) -> Callable[[Log], float]:

This factory creates an evaluation function that calculates the how many tokens in the generated answer are also present in the retrieved context. It is based on the paper Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering which finds that this method only slightly lags behind GPT-4 and outperforms GPT-3.5-turbo (see Table 4 from the above paper).

Parameters

  • context_field: The key name/field used for the retrieved context. Defaults to "context".

answer_context_faithfulness_statement_level_factory

def answer_context_faithfulness_statement_level_factory(
    question_field: str = "question",
    context_fields: List[str] = ["context"]
) -> Callable[[Log], float]:

This factory creates an evaluation function that measures the faithfulness of the generated answer to the given context by measuring how many statements from the generated answer can be inferred from the given context. It is based on the paper RAGAS: Automated Evaluation of Retrieval Augmented Generation which suggests using an LLM to create a list of all statements in the generated answer and assessing whether the given context supports each statement.

Parameters

  • question_field: The key name/field used for the question/query of the user. Defaults to "question".
  • context_fields: A list of key names/fields used for the retrieved contexts. Defaults to ["context"].

parea.evals.chat

AI Assistant/Chatbot-Specific Evaluation Metrics

goal_success_ratio_factory

def goal_success_ratio_factory(
    use_output: Optional[bool] = False,
    message_field: Optional[str] = None
) -> Callable[[Log], float]:

This factory creates an evaluation function that measures the success ratio of a goal-oriented conversation. Typically, a user interacts with a chatbot or AI assistant to achieve specific goals. This motivates to measure the quality of a chatbot by counting how many messages a user has to send before they reach their goal. One can further break this down by successful and unsuccessful goals to analyze user & LLM behavior.

Concretely:

  1. Delineate the conversation into segments by splitting them by the goals the user wants to achieve.
  2. Assess if every goal has been reached.
  3. Calculate the average number of messages sent per segment.

Parameters

  • use_output: Boolean indicating whether to use the output of the log to access the messages. Defaults to False.
  • message_field: The name of the field in the log that contains the messages. Defaults to None. If None, the messages are taken from the configuration attribute.

parea.evals.summary

Evaluation Metrics for Summarization Tasks

factual_inconsistency_binary_factory

def factual_inconsistency_binary_factory(
    article_field: Optional[str] = "article",
    model: Optional[str] = "gpt-4",
) -> Callable[[Log], float]:

This factory creates an evaluation function that classifies if a summary is factually inconsistent with the original text. It is based on the paper ChatGPT as a Factual Inconsistency Evaluator for Text Summarization which suggests using an LLM to assess the factuality of a summary by measuring how consistent the summary is with the original text, posed as a binary classification. They find that gpt-3.5-turbo-0301 outperforms baseline methods such as SummaC and QuestEval when identifying factually inconsistent summaries.

Parameters

  • article_field: The key name/field used for the content which should be summarized. Defaults to "article".
  • model: The model which should be used for grading. Currently, only supports OpenAI chat models. Defaults to "gpt-4".

factual_inconsistency_scale_factory

def factual_inconsistency_scale_factory(
    article_field: Optional[str] = "article",
    model: Optional[str] = "gpt-4",
) -> Callable[[Log], float]:

This factory creates an evaluation function that grades the factual consistency of a summary with the article on a scale from 1 to 10. It is based on the paper ChatGPT as a Factual Inconsistency Evaluator for Text Summarization which finds that using gpt-3.5-turbo-0301 leads to a higher correlation with human expert judgment when grading the factuality of summaries on a scale from 1 to 10 than baseline methods such as SummaC and QuestEval.

Parameters

  • article_field: The key name/field used for the content which should be summarized. Defaults to "article".
  • model: The model which should be used for grading. Currently, only supports OpenAI chat models. Defaults to "gpt-4".

likert_scale_factory

def likert_scale_factory(
    article_field: Optional[str] = "article",
    model: Optional[str] = "gpt-4",
) -> Callable[[Log], float]:

This factory creates an evaluation function that grades the quality of a summary on a Likert scale from 1-5 along the dimensions of relevance, consistency, fluency, and coherence. It is based on the paper Human-like Summarization Evaluation with ChatGPT which finds that using gpt-3.5-0301 leads to a highest correlation with human expert judgment when grading summaries on a Likert scale from 1-5 than baseline methods. Noteworthy is that BARTScore was very competitive to gpt-3.5-0301.

parea.evals.dataset_level

This module contains pre-built dataset-level evaluation metrics.

balanced_acc_factory

def balanced_acc_factory(score_name: str) -> Callable[[EvaluatedLog], float]:

This factory creates an evaluation function that calculates the balanced accuracy of the score score_name across all the classes in the dataset.

Parameters

  • score_name: The name of the score to calculate the balanced accuracy for.

Parameters

  • article_field: The key name/field used for the content which should be summarized. Defaults to "article".
  • model: The model which should be used for grading. Currently, only supports OpenAI chat models. Defaults to "gpt-4".

parea.schemas

Data Structures / Schemas

log

Consists of classes which make up the Log class.

Log

@define
class Log:
    configuration: LLMInputs = LLMInputs()
    inputs: Optional[dict[str, str]] = None
    output: Optional[str] = None
    target: Optional[str] = None
    latency: Optional[float] = 0.0
    input_tokens: Optional[int] = 0
    output_tokens: Optional[int] = 0
    total_tokens: Optional[int] = 0
    cost: Optional[float] = 0.0

This class encapsulates the logs from a traced function or LLM call. It consists of the following attributes:

  • configuration: The configuration of the LLM call if it was an LLM call.
  • inputs: The key-value pairs of inputs fed into the traced function or, if it was a templated LLM call, the inputs to the prompt template.
  • output: The output of the traced function or LLM call.
  • target: The (optional) target/ground truth output of the traced function or LLM call.
  • latency: The latency of the traced function or LLM call in seconds.
  • input_tokens: The number of tokens in the inputs if it was an LLM call.
  • output_tokens: The number of tokens in the output if it was an LLM call.
  • total_tokens: The total number of tokens in the inputs and output if it was an LLM call.
  • cost: The cost if it was an LLM call.

LLMInputs

@define
class LLMInputs:
    model: Optional[str] = None
    provider: Optional[str] = None
    model_params: Optional[ModelParams] = None
    messages: Optional[list[Message]] = None
    functions: Optional[list[Any]] = None
    function_call: Optional[Union[str, dict[str, str]]] = None

All the input variables which were fed into the LLM call.

ModelParams

@define
class ModelParams:
    temp: float = 1.0
    top_p: float = 1.0
    frequency_penalty: float = 0.0
    presence_penalty: float = 0.0
    max_length: Optional[int] = None
    response_format: Optional[dict] = None

The parameters used for the LLM call.

Message

@define
class Message:
    content: str
    role: Role = Role.user

Role

class Role(str, Enum):
    user = "user"
    assistant = "assistant"
    system = "system"
    example_user = "example_user"
    example_assistant = "example_assistant"