> ## Documentation Index
> Fetch the complete documentation index at: https://docs.parea.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Python

## Setup

Set `PAREA_API_KEY` as an environment variable, or in a .env file

```
export PAREA_API_KEY=<your API key>
```

Install the Parea Package

```bash theme={null}
pip install parea-ai
```

## `parea`

### `Parea`

```python theme={null}
@define
class Parea:
    api_key: str = field(init=True, default=os.getenv("PAREA_API_KEY"))
    project_name: str = field(init=True, default="default")
    cache: Cache = field(init=True, default=None)
```

The `Parea` object is used to initialize automatic tracing of any OpenAI call as well as to interact with the Parea API.
You should initialize it at the beginning of your LLM application with your API key (`api_key`). You can organize your
logs and experiments by specifying a project name (`project_name`), otherwise they will all appear under the
`default` project. You can also specify a `cache` to automatically cache any OpenAI
You can define a cache to automatically cache any OpenAI calls via the cache you specified.

#### `wrap_openai_client`

```python theme={null}
def wrap_openai_client(self, client: OpenAI) -> None:
```

This method patches the OpenAI client to automatically trace any OpenAI call made through the client. You only need to
call this method if your OpenAI package version is >= 1.0.0, and you are not using the module-level client.
Only call this method once after initializing the OpenAI client.

### `trace`

```python theme={null}
def trace(
    name: Optional[str] = None,
    tags: Optional[list[str]] = None,
    metadata: Optional[dict[str, Any]] = None,
    target: Optional[str] = None,
    end_user_identifier: Optional[str] = None,
    eval_funcs_names: Optional[list[str]] = None,
    eval_funcs: Optional[list[Callable]] = None,
    access_output_of_func: Optional[Callable] = None,
    apply_eval_frac: Optional[float] = 1.0,
    deployment_id: Optional[str] = None,
):
```

The `trace` decorator is used to trace a function, capture it inputs and outputs, as well as apply evaluation functions to its output.
It automatically attaches the current trace to the parent trace, if one exists, or sets it as the current trace.
This creates a nested trace structure, which can be viewed in the logs.

#### Parameters

* `name`: The name of the trace. If not provided, the function's name will be used.
* `tags`: A list of tags to attach to the trace.
* `metadata`: A dictionary of metadata to attach to the trace.
* `target`: An optional ground truth/expected output for the inputs and can be used by evaluation functions.
* `end_user_identifier`: An optional identifier for the end user that is using your application.
* `eval_funcs_names`: A list of names of evaluation functions, created in the Datasets tab, to evaluate on the output of the traced function. They will be applied non-blocking and asynchronously in the backend.
* `eval_funcs`: A list of evaluation functions, in your code, to evaluate on the output of the traced function.
* `access_output_of_func`: An optional function that takes the output of the traced function and returns the value which should be used as `output` of the function for evaluation functions.
* `apply_eval_frac`: The fraction of times the evaluation functions should be applied. Defaults to 1.0.
* `deployment_id`: The deployment id of a prompt configuration

### `Experiment`

```python theme={null}
@define
class Experiment:
    name: str = field(init=False)
    data: Iterator[Dict] = field(init=True)
    func: Callable = field(init=True)
    experiment_stats: ExperimentStatsSchema = field(init=False, default=None)
```

The `Experiment` class is used to define an experiment of your LLM application. It is initialized the data to run the
experiment on (`data`), and the entry point/function (`func`). You can read more about running experiments
[here](/welcome/getting-started-evaluation).

#### `run`

This method runs the experiment and saves the stats to the `experiment_stats` attribute. You can optionally specify the
`name` of the experiment as an argument.

# `parea.evals`

We provide off-the-shelf evaluation metrics for general scenarios and special scenarios: RAG, summarization & chat.
Oftentimes, they come in the form of a factory which e.g. requires the field names/keys to identify the contexts
provided for RAG or the question asked by the user.

The general setup of an evaluation is to receive a `Log` data structure and return a float or a boolean.
If it is a factory, the factory will return an evaluation function.

## `parea.evals.general`

General Purpose Evaluation Metrics

### `levenshtein`

This evaluation metric measures [the Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) between the generated output and the target.
It measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to change the generated output to the target.
And then normalizes it by the length of the target or the generated output, whichever is longer.

### `llm_grader_factory`

```python theme={null}
def llm_grader_factory(
    model: str,
    question_field: str = "question"
) -> Callable[[Log], float]:
```

This factory creates an evaluation function that uses an LLM to grade the response of an LLM to a given question.
It is based on the paper [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/abs/2306.05685)
which intorduces general-purpose zero-shot prompt to rate responses from an LLM to a given question on a scale from 1-10.
They find that GPT-4's ratings agree as much with a human rater as a human annotator agrees with another one (>80%).
Further, they observe that the agreement with a human annotator increases as the response rating gets clearer.
Additionally, they investigated how much the evaluating LLM overestimated its responses and found that GPT-4 and
Claude-1 were the only models that didn't overestimate themselves.

#### Parameters

* `model`: The model which should be used for grading. Currently, only supports OpenAI chat models.
* `question_field`: The key name/field used for the question/query of the user. Defaults to `"question"`.

### `answer_relevancy_factory`

```python theme={null}
def answer_relevancy_factory(
    question_field: str = "question",
    n_generations: int = 3
) -> Callable[[Log], float]:
```

This factory creates an evaluation function that measures how relevant the generated response is to the given question.
It is based on the paper [RAGAS: Automated Evaluation of Retrieval Augmented Generation](https://arxiv.org/abs/2309.15217)
which suggests using an LLM to generate multiple questions that fit the generated answer and measure the cosine
similarity of the generated questions with the original one.

#### Parameters

* `question_field`: The key name/field used for the question/query of the user. Defaults to `"question"`.
* `n_generations`: The number of questions which should be generated. Defaults to 3.

### `self_check`

Given that many API-based LLMs don't reliably give access to the log probabilities of the generated tokens, assessing
the certainty of LLM predictions via perplexity isn't possible.
The [SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models](https://arxiv.org/abs/2303.08896) paper
suggests measuring the average factuality of every sentence in a generated response. They generate additional responses
from the LLM at a high temperature and check how much every sentence in the original answer is supported by the other generations.
The intuition behind this is that if the LLM knows a fact, it's more likely to sample it. The authors find that this
works well in detecting non-factual and factual sentences and ranking passages in terms of factuality.
The authors noted that correlation with human judgment doesn't increase after 4-6 additional
generations when using `gpt-3.5-turbo` to evaluate biography generations.

### `lm_vs_lm_factuality_factory`

```python theme={null}
def lm_vs_lm_factuality_factory(examiner_model: str = "gpt-3.5-turbo") -> Callable[[Log], float]:
```

This factory creates an evaluation function that measures the factuality of an LLM's response to a given question.
It is based on the paper [LM vs LM: Detecting Factual Errors via Cross Examination](https://arxiv.org/abs/2305.13281) which proposes using
another LLM to assess an LLM response's factuality. To do this, the examining LLM generates follow-up questions to the
original response until it can confidently determine the factuality of the response.
This method outperforms prompting techniques such as asking the original model, "Are you sure?" or instructing the
model to say, "I don't know," if it is uncertain.

#### Parameters

* `examiner_model`: The model which will examine the original model. Currently, only supports OpenAI chat models.

### `semantic_similarity_factory`

```python theme={null}
def semantic_similarity_factory(embd_model: str = "text-embedding-3-small") -> Callable[[Log], float]:
```

This factory creates an evaluation function that measures the semantic similarity of the generated response to the given question.
It calculates the cosine similarity of the embeddings of the generated output and the target answer/ground truth. It transforms
the -1 to 1 scale of the cosine similarity to a 0 to 1 scale by adding 1 and dividing by 2.

#### Parameters

* `embd_model`: The model which should be used for embedding. Currently, only supports OpenAI embedding models.

#### Instances of factory

```
semantic_similarity_oai_3_small = semantic_similarity_factory(embd_model="text-embedding-3-small")
semantic_similarity_oai_3_large = semantic_similarity_factory(embd_model="text-embedding-3-large")
semantic_similarity_oai_ada_002 = semantic_similarity_factory(embd_model="text-embedding-ada-002")
```

## `parea.evals.rag`

RAG Specific Evaluation Metrics

### `context_query_relevancy_factory`

```python theme={null}
def context_query_relevancy_factory(
    question_field: str = "question",
    context_fields: List[str] = ["context"]
) -> Callable[[Log], float]:
```

This factory creates an evaluation function that measures how relevant the retrieved context is to the given question.
It is based on the paper [RAGAS: Automated Evaluation of Retrieval Augmented Generation](https://arxiv.org/abs/2309.15217)
which suggests using an LLM to extract any sentence from the retrieved context relevant to the query. Then, calculate
the ratio of relevant sentences to the total number of sentences in the retrieved context.

#### Parameters

* `question_field`: The key name/field used for the question/query of the user. Defaults to `"question"`.
* `context_fields`: A list of key names/fields used for the retrieved contexts. Defaults to `["context"]`.

### `context_ranking_pointwise_factory`

```python theme={null}
def context_ranking_pointwise_factory(
    question_field: str = "question",
    context_fields: List[str] = ["context"],
    ranking_measurement="average_precision"
) -> Callable[[Log], float]:
```

This factory creates an evaluation function that measures how well the retrieved contexts are ranked by relevancy to the given query
by pointwise estimation of the relevancy of every context to the query. It is based on the paper
[RAGAS: Automated Evaluation of Retrieval Augmented Generation](https://arxiv.org/abs/2309.15217) which suggests using an LLM
to check if every extracted context is relevant. Then, they measure how well the contexts are ranked by calculating the
mean average precision. Note that this approach considers any two relevant contexts equally important/relevant to the query.

#### Parameters

* `question_field`: The key name/field used for the question/query of the user. Defaults to `"question"`.
* `context_fields`: A list of key names/fields used for the retrieved contexts. Defaults to `["context"]`.
* `ranking_measurement`: Method to calculate ranking. Currently, only supports `"average_precision"`.

### `context_ranking_listwise_factory`

```python theme={null}
def context_ranking_listwise_factory(
    question_field: str = "question",
    context_fields: List[str] = ["context"],
    ranking_measurement="ndcg",
    n_contexts_to_rank=10,
) -> Callable[[Log], float]:
```

This factory creates an evaluation function that measures how well the retrieved contexts are ranked by relevancy to the given query
by listwise estimation of the relevancy of every context to the query. It is based on the paper
[Zero-Shot Listwise Document Reranking with a Large Language Model](https://arxiv.org/abs/2305.02156) which suggests using an LLM
to rerank a list of contexts and use that to evaluate how well the contexts are ranked by relevancy to the given query.
The authors used a progressive listwise reordering if the retrieved contexts don't fit into the context window of the LLM.

#### Parameters

* `question_field`: The key name/field used for the question/query of the user. Defaults to `"question"`.
* `context_fields`: A list of key names/fields used for the retrieved contexts. Defaults to `["context"]`.
* `ranking_measurement`: The measurement to use for ranking. Currently only supports `"ndcg"`.
* `n_contexts_to_rank`: The number of contexts to rank listwise. Defaults to `10`.

### `context_has_answer_factory`

```python theme={null}
def context_has_answer_factory(
    question_field: Optional[str] = "question",
    model: Optional[str] = "gpt-3.5-turbo-0125"
    ) -> Callable[[Log], bool]:
```

This factory creates an evaluation metric which assess whether the given context has the answer to the given question.
It is useful to measure the performance of a model in a question-answering task by measuring Hit Rate without the need to know the correct answer.

#### Parameters

* `question_field`: The key name/field used for the question/query of the user. Defaults to "question".
* `model`: The model which should be used for grading. Currently, only supports OpenAI chat models. Defaults to "gpt-3.5-turbo-0125".

### `answer_context_faithfulness_binary_factory`

```python theme={null}
def answer_context_faithfulness_binary_factory(
    question_field: Optional[str] = "question",
    context_field: Optional[str] = "context",
    model: Optional[str] = "gpt-4",
) -> Callable[[Log], float]:
```

This factory creates an evaluation function that classifies if the generated answer was faithful to the given context.
It is based on the paper [Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering](https://arxiv.org/abs/2307.16877)
which suggests using an LLM to flag any information in the generated answer that cannot be deduced from the given context.
They find that GPT-4 is the best model for this analysis as measured by correlation with human judgment.

#### Parameters

* `question_field`: The key name/field used for the question/query of the user. Defaults to `"question"`.
* `context_fields`: A list of key names/fields used for the retrieved contexts. Defaults to `["context"]`.
* `model`: The model which should be used for grading. Currently, only supports OpenAI chat models. Defaults to `"gpt-4"`.

### `answer_context_faithfulness_precision_factory`

```python theme={null}
def answer_context_faithfulness_precision_factory(
    context_field: Optional[str] = "context"
) -> Callable[[Log], float]:
```

This factory creates an evaluation function that calculates the how many tokens in the generated answer are also present in the retrieved context.
It is based on the paper [Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering](https://arxiv.org/abs/2307.16877)
which finds that this method only slightly lags behind GPT-4 and outperforms GPT-3.5-turbo (see Table 4 from the above paper).

#### Parameters

* `context_field`: The key name/field used for the retrieved context. Defaults to `"context"`.

### `answer_context_faithfulness_statement_level_factory`

```python theme={null}
def answer_context_faithfulness_statement_level_factory(
    question_field: str = "question",
    context_fields: List[str] = ["context"]
) -> Callable[[Log], float]:
```

This factory creates an evaluation function that measures the faithfulness of the generated answer to the given context
by measuring how many statements from the generated answer can be inferred from the given context. It is based on the paper
[RAGAS: Automated Evaluation of Retrieval Augmented Generation](https://arxiv.org/abs/2309.15217) which suggests using an LLM
to create a list of all statements in the generated answer and assessing whether the given context supports each statement.

#### Parameters

* `question_field`: The key name/field used for the question/query of the user. Defaults to `"question"`.
* `context_fields`: A list of key names/fields used for the retrieved contexts. Defaults to `["context"]`.

## `parea.evals.chat`

AI Assistant/Chatbot-Specific Evaluation Metrics

### `goal_success_ratio_factory`

```python theme={null}
def goal_success_ratio_factory(
    use_output: Optional[bool] = False,
    message_field: Optional[str] = None
) -> Callable[[Log], float]:
```

This factory creates an evaluation function that measures the success ratio of a goal-oriented conversation.
Typically, a user interacts with a chatbot or AI assistant to achieve specific goals.
This motivates to measure the quality of a chatbot by counting how many messages a user has to send before they reach their goal.
One can further break this down by successful and unsuccessful goals to analyze user & LLM behavior.

Concretely:

1. Delineate the conversation into segments by splitting them by the goals the user wants to achieve.
2. Assess if every goal has been reached.
3. Calculate the average number of messages sent per segment.

#### Parameters

* `use_output`: Boolean indicating whether to use the output of the log to access the messages. Defaults to False.
* `message_field`: The name of the field in the log that contains the messages. Defaults to `None`. If `None`, the messages are taken from the `configuration` attribute.

## `parea.evals.summary`

Evaluation Metrics for Summarization Tasks

### `factual_inconsistency_binary_factory`

```python theme={null}
def factual_inconsistency_binary_factory(
    article_field: Optional[str] = "article",
    model: Optional[str] = "gpt-4",
) -> Callable[[Log], float]:
```

This factory creates an evaluation function that classifies if a summary is factually inconsistent with the original text.
It is based on the paper [ChatGPT as a Factual Inconsistency Evaluator for Text Summarization](https://arxiv.org/abs/2303.15621)
which suggests using an LLM to assess the factuality of a summary by measuring how consistent the summary is with
the original text, posed as a binary classification. They find that `gpt-3.5-turbo-0301` outperforms
baseline methods such as SummaC and QuestEval when identifying factually inconsistent summaries.

#### Parameters

* `article_field`: The key name/field used for the content which should be summarized. Defaults to `"article"`.
* `model`: The model which should be used for grading. Currently, only supports OpenAI chat models. Defaults to `"gpt-4"`.

### `factual_inconsistency_scale_factory`

```python theme={null}
def factual_inconsistency_scale_factory(
    article_field: Optional[str] = "article",
    model: Optional[str] = "gpt-4",
) -> Callable[[Log], float]:
```

This factory creates an evaluation function that grades the factual consistency of a summary with the article on a scale from 1 to 10.
It is based on the paper [ChatGPT as a Factual Inconsistency Evaluator for Text Summarization](https://arxiv.org/abs/2303.15621)
which finds that using `gpt-3.5-turbo-0301` leads to a higher correlation with human expert judgment when grading
the factuality of summaries on a scale from 1 to 10 than baseline methods such as SummaC and QuestEval.

#### Parameters

* `article_field`: The key name/field used for the content which should be summarized. Defaults to `"article"`.
* `model`: The model which should be used for grading. Currently, only supports OpenAI chat models. Defaults to `"gpt-4"`.

### `likert_scale_factory`

```python theme={null}
def likert_scale_factory(
    article_field: Optional[str] = "article",
    model: Optional[str] = "gpt-4",
) -> Callable[[Log], float]:
```

This factory creates an evaluation function that grades the quality of a summary on a Likert scale from 1-5 along
the dimensions of relevance, consistency, fluency, and coherence. It is based on the paper
[Human-like Summarization Evaluation with ChatGPT](https://arxiv.org/abs/2304.02554) which finds that using `gpt-3.5-0301`
leads to a highest correlation with human expert judgment when grading summaries on a Likert scale from 1-5 than baseline
methods. Noteworthy is that [BARTScore](https://arxiv.org/abs/2106.11520) was very competitive to `gpt-3.5-0301`.

## `parea.evals.dataset_level`

This module contains pre-built dataset-level evaluation metrics.

### `balanced_acc_factory`

```python theme={null}
def balanced_acc_factory(score_name: str) -> Callable[[EvaluatedLog], float]:
```

This factory creates an evaluation function that calculates the balanced accuracy of the score `score_name` across all
the classes in the dataset.

#### Parameters

* `score_name`: The name of the score to calculate the balanced accuracy for.

#### Parameters

* `article_field`: The key name/field used for the content which should be summarized. Defaults to `"article"`.
* `model`: The model which should be used for grading. Currently, only supports OpenAI chat models. Defaults to `"gpt-4"`.

# `parea.schemas`

Data Structures / Schemas

## `log`

Consists of classes which make up the `Log` class.

### `Log`

```python theme={null}
@define
class Log:
    configuration: LLMInputs = LLMInputs()
    inputs: Optional[dict[str, str]] = None
    output: Optional[str] = None
    target: Optional[str] = None
    latency: Optional[float] = 0.0
    input_tokens: Optional[int] = 0
    output_tokens: Optional[int] = 0
    total_tokens: Optional[int] = 0
    cost: Optional[float] = 0.0
```

This class encapsulates the logs from a traced function or LLM call. It consists of the following attributes:

* `configuration`: The configuration of the LLM call if it was an LLM call.
* `inputs`: The key-value pairs of inputs fed into the traced function or, if it was a templated LLM call, the inputs to the prompt template.
* `output`: The output of the traced function or LLM call.
* `target`: The (optional) target/ground truth output of the traced function or LLM call.
* `latency`: The latency of the traced function or LLM call in seconds.
* `input_tokens`: The number of tokens in the inputs if it was an LLM call.
* `output_tokens`: The number of tokens in the output if it was an LLM call.
* `total_tokens`: The total number of tokens in the inputs and output if it was an LLM call.
* `cost`: The cost if it was an LLM call.

### `LLMInputs`

```python theme={null}
@define
class LLMInputs:
    model: Optional[str] = None
    provider: Optional[str] = None
    model_params: Optional[ModelParams] = None
    messages: Optional[list[Message]] = None
    functions: Optional[list[Any]] = None
    function_call: Optional[Union[str, dict[str, str]]] = None
```

All the input variables which were fed into the LLM call.

### `ModelParams`

```python theme={null}
@define
class ModelParams:
    temp: float = 1.0
    top_p: float = 1.0
    frequency_penalty: float = 0.0
    presence_penalty: float = 0.0
    max_length: Optional[int] = None
    response_format: Optional[dict] = None
```

The parameters used for the LLM call.

### `Message`

```python theme={null}
@define
class Message:
    content: str
    role: Role = Role.user
```

### `Role`

```python theme={null}
class Role(str, Enum):
    user = "user"
    assistant = "assistant"
    system = "system"
    example_user = "example_user"
    example_assistant = "example_assistant"
```
