Finding the best prompt for a given input is not always easy. Testing if the prompt templates works well across different inputs is harder. Finally, deciding which prompt template from many is the best across different inputs is very tricky.

For this purpose, we created the Lab. The Lab allows you to quickly iterate on your prompts and the models you use for them by visualizing the responses of your prompts across multiple inputs side-by-side. As you change your prompts or models, we automatically version them, so you never lose that one good prompt. Additionally, in the Lab you can run various test cases and evaluation metrics to get a more fulsome view of how your prompts are performing.

Lab

Prompts & models

You can define templated prompts, and use them with different models. We currently support OpenAI, Anthropic, Azure OpenAI, and Anyscale (LLama2 & CodeLLama) models. You define variables in your prompt via {{variable}} syntax, e.g. Tell me the sentiment of {{tweet}}.

OpenAI function calling

For OpenAI function calling we have an editor in which you can define those & attach to your prompts. Whenever you edit the functions, we automatically version control them. Here, you can also define variables via {{variable}} syntax to parameterize the functions on your input variables.

Inputs to your templates

Every row in the Lab is an input (key-value pairs) to your prompt template. You can define the key-value pairs for the variables in your prompt either manually in the text box, or by importing them from a test case collection in the test hub.

Evaluating

You can either manually rate responses as ‘good’ or ‘bad’, or use evaluation functions to score the responses. The manual feedback, and scores are automatically aggregated for every column.