Synthetic Data Generation for Q&A Tasks
We will use Instructor in TypeScript to generate synthetic data for a question-answering task.
Joschka Braun on Jul 2, 2024
Synthetic data are a great way to bootstrap test data for an AI application if no real production data are available or cannot be used. In this tutorial, we will use Instructor in TypeScript to generate synthetic data for a question-answering task. Instructor is designed to simplify generating structured responses from LLM APIs. It does that by patching the respective model provider API client and using features such as tool use, function call or JSON mode. It will then use user-specified Zod schemas to generate structured responses from the model.
To install instructor, zod and openai, run the following command:
Using instructor amounts to three steps:
- Patch the OpenAI client with Instructor
- Define a Zod schema for the structured response
- Call the
chat.completions.create
method with the Zod schema as theresponse_model
Patching the OpenAI client is easy with Instructor:
Next we will define a Zod schema for the question-answer pairs and require that the answer is a number (to simplify using it to test our application).
Note, we can optionally specify descriptions for each attribute which help steering the synthetic data generation and will be used in the prompt. In order to generate a list of question-answer pairs, we define a schema for the list:
Finally, using Instructor to generate the synthetic data is as simple as calling the chat.completions.create
method and specifying the Zod schema as the response_model
.
We will get back a response of type QuestionAnswerPairs
and can save that as our Q&A dataset.
This will generate count
Q&A pairs on the topic topic
and return them as a structured response.
To customize this to your use case, you will want to adapt this prompt to include information on your use case.
Fully Working Code
Below you can see the fully working code: