Test & evaluate your LLM application
target
).
While 10 samples are sufficient to get started, it’s useful to continue to add more samples whenever you identify new failure cases from your production traffic.
The goal of evaluation is to assess each of component and improve them over time.
In practice, it’s better to assume for that your data is noisy, the LLM is imperfect, and evaluation methods are a little bit wrong.
trace
decorator to your function.
This will automatically execute the evaluation metric on the output of the function in non-blocking in the background after the function execution has finished.
You can apply evaluations at any “level” of your application by attaching them to the corresponding function decorated with trace
.
This is useful to understand the quality at different levels of granularity (“Was the right context extracted?” vs. “Was the answer correct?”).
General Purpose Evaluation
levenshtein
: calculates the number of character-edits in the generated output to match the target and normalizes it by the length of the output; more herellm_grader
: leverages a general-purpose zero-shot prompt to rate responses from an LLM to a given question on a scale from 1-10; more hereanswer_relevancy
: measures how relevant the generated response is to the given question; more hereself_check
: measures how well the LLM call is self consistent when generating multiple responses; more herelm_vs_lm_factuality
: uses another LLM to examine original LLM response for factuality; more heresemantic_similarity
: calculates the cosine similarity between output and ground truth; more hereRAG Specific Evaluations
context_query_relevancy
: calculates the percentage of sentences in the context are relevant to the query; more herecontext_ranking_pointwise
: measures how well the retrieved contexts are ranked by relevancy to the given query by pointwise estimation; more herecontext_ranking_listwise
: measures how well the retrieved contexts are ranked by relevancy to the given query by listwise estimation; more herecontext_has_answer
: classifies if the retrieved context contains the answer to the query; more hereanswer_context_faithfulness_binary
: classifies if the answer is faithful to the context; more hereanswer_context_faithfulness_precision
: calculates how many tokens in the generated answer are also present in the retrieved context; more hereanswer_context_faithfulness_statement_level
: calculates the percentage of statements from the generated answer that can be inferred from the context; more hereChatbot Specific Evaluations
goal_success_ratio
: measures how many turns a user has to converse on average with your AI assistant to achieve a goal; more hereSummarization Specific Evaluations
factual_inconsistency_binary
: classifies if a summary is factually inconsistent with the original text; more herefactual_inconsistency_scale
: grades the factual consistency of a summary with the article on a scale from 1 to 10; more herelikert_scale
: grades the quality of a summary on a Likert scale from 1-5 along the dimensions of relevance, consistency, fluency, and coherence; more herelog
parameter.
All evaluation functions accept the log
parameter, which provides all the needed information to perform an evaluation.
Evaluation functions are expected to return floating point scores or booleans.
Log Schema Definition
None
/null
in the evaluation function to skip the evaluation for that log.
EvaluationResult
object with a reason
attribute.
Updating above example:
EvaluationResult
object is as follows:
EvaluationResult Definition
EvaluationResult
objects.
experiment
method of the Parea client p
and run it by calling the run
method of the experiment.
You can optionally specify the name of the run by passing it to the run
method.run
method. You can optionally specify the run name of the experiment using the --run_name
flag in the CLI.default
project.
You can specify a project name when you initialize the Parea client.
This will automatically create a new project with that name if it didn’t exist before.
Note, that the name of the project is only allowed to contain alphanumeric characters, dashes and underscores.
trace
decorator to the respective functions with the corresponding eval.
This will create scores from every sub-step, when running the entrypoint function (no matter if running an experiment or running the function directly).
To access different targets/expected outputs for sub-steps, you can pass the target as a dictionary when defining the experiment.
Note, that this will convert the target to a string such that you will need to convert it back to a dictionary in the evaluation function in order to access the sub-step target.
See Python & TypeScript examples below.
Example: Testing Sub-steps
n_trials
parameter when defining an experiment:
dataset_level_evals
or datasetLevelEvals
parameter when defining an experiment:
EvaluatedLog
objects and are expected to return a single floating point score or a boolean.
The EvaluatedLog
object is a subclass of the Log
object with the additional scores
attribute.
The scores will be attached to the Experiment and can be viewed in the overview table and the detailed view of the experiment.
EvaluatedLog Schema Definition
Dataset Level Evaluation Example (full)
Pre-built Dataset Level Evaluations
balanced_accuracy
: The balanced accuracy of a score for the experiment; more hereShare
button on the top right of the experiment page. This
will generate a link following the format
https://app.parea.ai/public-experiments/<org_slug>/<project_name>/<experiment_uuid>
which anyone can access.
You can compare all public experiments in a project under https://app.parea.ai/public-experiments/<org_slug>/<project_name>
.
Integrating Parea with DVC
Setup DVC
Integrate Parea with DVC
.parea
directory with a dvc.yaml
and a metrics.json
file if they don’t exist.
The dvc.yaml
file will point to the metrics.json
file and the metrics.json
file will contain the metrics of the experiments.
Both files are necessary for DVC. You can always re-run the command to check if the integration is set up properly.dvc exp show
here, and dvc exp apply
here.
Compare
button to compare them.
This will open up a new view which shows
a high-level comparison of the experiment scores at the top and
a side-by-side view of the individual results of the experiments on the same samples at the bottom.
The high-level comparison consists of 2-3 cards.
The first card is only shown if 2 experiments are compared,
and it displays for every evaluation metric how average and standard deviation have changed as well as the number of improvements and regressions.
The second card compares the evaluation metric averages for every experiment as a bar plot.
The third card shows the distribution plot of the selected evaluation metric as a histogram.
n_workers
/ nWorkers
parameter when defining an experiment.
nest-asyncio
and apply it: