Add evaluation metrics to the playground.
Evaluation metrics
button in a prompt session.
Here, you will have the option to select an existing metric or create a new one.
Register new auto-eval metric
. This will allow you to create a metric based on your specific inputs.
Next, find the metric you want to use based on your use case. Each metric has its required and optional variables.
Your prompt template must have a variable for any required inputs. For example, the LLM Grader
metric expects your prompt to have a {{question}}
variable.
If your variable is named something else, you can select which variable to associate with the question
field from the drop-down menu.
Click Register
once you are done, and that metric will be enabled.
Example - Auto-eval from New Prompt Session
Create New Session
.
You will see a rag example prompt pre-populated. The prompt is:context
has been pre-populated with a snippet from Nike’s 10k filings.
Our question
to ask the LLM is: Which operating segment contributed least to total Nike brand revenue in fiscal 2023?
Click Compare
to see what the LLM’s response is.Evaluation metrics
and Register new auto-eval metric
.
Select RAG
as our use case, and let’s start with Context Relevance
as our metric. Click Setup
.question
and context
input.
Since our prompt template already has these inputs, we can click Register
. Now, this metric will always be available.
To finish, click Set eval metric(s)
to enable this metric in our current Playground session.
The Compare button will now say Compare & evaluate
; click it.First, a new LLM result will be generated. Then, the session will automatically save your results. Then, the evaluation score will be computed.
You will see your score at the top of the Prompt section and the Inference section.My score was Context Relevance-b6CK' score: 0.08
what was yours?target
variable to our prompt to represent the correct answer.In the Input section, click the blue button to Add inputs to test collection
.Rag Example
for our new collection. And where it says Define a target
paste:Create collection
.General
as our use case and Answer Matches Target - LLM Judge
as our metric.
Once again, no changes are needed since our prompt template input variable names match the required inputs of the metric,
click Register
, then Set eval metric(s)
.We now have two metrics attached to this session, Context Relevance
and Answer Matches Target - LLM Judge
.Instead of clicking Compare & evaluate
, since we do not need to call the LLM provider again,
select the down chevron ⌄
icon next to Compare & evaluate
and select Evaluate
.
This will run the evaluation metrics on our existing LLM response.Congrats, that’s it!You should now see two scores. My scores are 'Context Relevance-b6CK' score: 0.08 / 'Answer Matches Target - LLM Judge-rFun' score: 0.00
.
Did your prompt also fail the Answer Matches Target
eval?CHALLENGE: Update your prompt to get it to pass. 🤓Want to explore more? Try our Rag Tutorial.Set eval metric(s)
to attach them to your current session.
To create a new custom evaluation functions, click Create new custom metric
.
eval_fun
signature def eval_fun(log: Log) -> float:
.
To ensure that your evaluation metrics are reusable across the entire Parea ecosystem, and with any LLM models or LLM use cases, we introduced the log
parameter.
All evaluation functions accept the log
parameter, which provides all the needed information to perform an evaluation.
Evaluation functions are expected to return floating point scores or booleans.
If you have this function and return a float or boolean, your new metric will be valid.
A simple example could be:
Log Schema Definition
Supported Python Packages
numpy
openai
anthropic
marvin
spacy
nltk
parea-ai
tiktoken
Request new package
button in the UI, or reach out.Eval metric template code
Example function call response
function
, and that dictionary will always have a name
field and an arguments
field.
Example eval function testing returned function call