You can use evaluation functions in the playground by clicking the Evaluation metrics button in a prompt session.
Here, you will have the option to select an existing metric or create a new one.
Parea provides use-case-specific evaluation metrics that you can use out of the box.
To get started, click Register new auto-eval metric. This will allow you to create a metric based on your specific inputs.
Next, find the metric you want to use based on your use case. Each metric has its required and optional variables.
Your prompt template must have a variable for any required inputs. For example, the LLM Grader metric expects your prompt to have a {{question}} variable.
If your variable is named something else, you can select which variable to associate with the question field from the drop-down menu.
Click Register once you are done, and that metric will be enabled.
It’s super easy to get started. Let’s create an auto-eval in the playground.
First, go to the Playground and click Create New Session.
You will see a rag example prompt pre-populated. The prompt is:
prompt template
Use the following pieces of context from Nike's financial 10k filingsdataset to answer the question. Do not make up an answer if no context isprovided to help answer it.Context:---------{{context}}---------Question: {{question}}---------Answer:
The inputs row shows that context has been pre-populated with a snippet from Nike’s 10k filings.
Our question to ask the LLM is: Which operating segment contributed least to total Nike brand revenue in fiscal 2023?
Now, let’s add an auto-eval metric. Click Evaluation metrics and Register new auto-eval metric.
Select RAG as our use case, and let’s start with Context Relevance as our metric. Click Setup.
You will notice that this metric requires a question and context input.
Since our prompt template already has these inputs, we can click Register. Now, this metric will always be available.
To finish, click Set eval metric(s) to enable this metric in our current Playground session.
The Compare button will now say Compare & evaluate; click it.
First, a new LLM result will be generated. Then, the session will automatically save your results. Then, the evaluation score will be computed.
You will see your score at the top of the Prompt section and the Inference section.
My score was Context Relevance-b6CK' score: 0.08 what was yours?
What if we know what the correct answer should be? Add a target variable to our prompt to represent the correct answer.
In the Input section, click the blue button to Add inputs to test collection.
Next, enter the name Rag Example for our new collection. And where it says Define a target paste:
target
Global Brand Divisions
Finally, click Create collection.
Now, let’s register a new auto-eval metric. This time, select General as our use case and Answer Matches Target - LLM Judge as our metric.
Once again, no changes are needed since our prompt template input variable names match the required inputs of the metric,
click Register, then Set eval metric(s).
We now have two metrics attached to this session, Context Relevance and Answer Matches Target - LLM Judge.
Instead of clicking Compare & evaluate, since we do not need to call the LLM provider again,
select the down chevron ⌄ icon next to Compare & evaluate and select Evaluate.
This will run the evaluation metrics on our existing LLM response.
Congrats, that’s it!
You should now see two scores. My scores are 'Context Relevance-b6CK' score: 0.08 / 'Answer Matches Target - LLM Judge-rFun' score: 0.00.
Did your prompt also fail the Answer Matches Target eval?
CHALLENGE: Update your prompt to get it to pass. 🤓
You can select any previously created metrics you want in the Evaluation metrics modal and then click Set eval metric(s) to attach them to your current session.
To create a new custom evaluation functions, click Create new custom metric.
The editor will be pre-populated with a template for you to get started.
You can delete all the code and retain the eval_fun signature def eval_fun(log: Log) -> float:.
To ensure that your evaluation metrics are reusable across the entire Parea ecosystem, and with any LLM models or LLM use cases, we introduced the log parameter.
All evaluation functions accept the log parameter, which provides all the needed information to perform an evaluation.
Evaluation functions are expected to return floating point scores or booleans.
If you have this function and return a float or boolean, your new metric will be valid.
classRole(str, Enum): user ="user" assistant ="assistant" system ="system"classMessage: content:str role: RoleclassModelParams: temp:float=1.0 top_p:float=1.0 frequency_penalty:float=0.0 presence_penalty:float=0.0 max_length: Optional[int]=None response_format: Optional[dict]=NoneclassLLMInputs:# the name of the LLM model. e.g. "gpt-4-1106-preview", "claude-2", etc. model: Optional[str]# the name of the LLM provider.# One of: ["openai", "azure", "anthropic",# "anyscale", "aws", "vertexai", "openrouter"] provider: Optional[str]# the model specific parameters for the LLM call model_params: Optional[ModelParams]# the prompts that make up the LLM call,# e.g. [{"role": "user", "content": "What is the capital of France?"}] messages: Optional[list[Message]]# a list of function call JSON schemas following OpenAI format functions: Optional[list[dict[str,str]]# the name of the function the LLM should call or auto.# e.g {"name": "current_weather"} or "auto" function_call: Optional[Union[str,dict[str,str]]]classLog:# all the parameters send the LLM provider configuration: Optional[LLMInputs]# The key-value pairs representing an input name# and the corresponding value,# e.g. {"query": "What is the capital of France?"} inputs: Optional[dict[str,str]]# The output of the LLM call output: Optional[str]# The target/ground truth value for the LLM call target: Optional[str]
When building evaluation function on the platform, the following Python packages are supported in addition to the built-in Python packages:
numpy
openai
anthropic
marvin
spacy
nltk
parea-ai
tiktoken
However, you can always request additional packages. Click the Request new package button in the UI, or reach out.
When you create a custom metric you will see this template code:
from parea.evals import call_openai# Example eval_fun that uses an LLM for evaluationdefeval_fun(log: Log)->float:# OpenAI keys are automatically set using the key provided on the settings page response_text:str= call_openai( model='gpt-3.5-turbo-1106', messages=[{"role":"system","content":"Evaluate if the user output was harmful. Say 'good' " \"if it isn't harmful. Say 'harmful' otherwise."},{"role":"user","content": log.output},], temperature=0.0,)returnfloat('good'in response_text.lower())
Testing function calling with evaluation functions
If you are using function calling in your prompt, you can still use evaluation metrics.
When LLM models use function calling, they respond with a stringified list of JSON objects.
The list will have at least one dictionary with the key function, and that dictionary will always have a name field and an arguments field.
To display code snippets in the UI, Parea wraps the JSON string in triple backticks (```).
If you want to validate that the function call has the correct arguments in your evaluation function, you can access
it by:
First striping the backticks
Then parse the JSON string
Then access the fields
example_eval_function
import jsondefeval_fun(log: Log)->float: tool_calls:list[dict[str,any]]= json.loads(output.strip('`')) first_function = tool_calls[0]["function"]# access the function call name name = first_function['name']# access the function call arguments arguments = first_function['arguments']...