A Systematic Workflow to build Production-Ready LLM Applications
Practical workflow focusing on continuous improvement and data-driven decision-making.
Joel Alexander on Jul 17, 2024
This post outlines a practical workflow to help teams move from prototypes to production, focusing on continuous improvement and data-driven decision-making.
1. Hypothesis Testing with Prompt Engineering
Start simple. Begin with one of the leading foundation models and test your hypothesis using basic prompt engineering. The goal here isn’t perfection but establishing a baseline and confirming that the LLM can produce reasonable responses for your use case. If you already have a prompt/chain in place, skip to the next step.
Don’t spend too much time engineering your prompt at this stage. After you collect more data, you can make hypothesis-driven tweaks.
2. Dataset Creation from Users
Use an observability tool to collect real user questions and responses. It’s okay if you aren’t collecting feedback yet; first, you want a diverse dataset to iterate on.
Tools like Parea can simplify this process. Use our py/ts SDK to quickly instrument your code, and then use your logs to create a dataset. The next step is to establish an evaluation metric.
Developing effective evaluation metrics is crucial but challenging. Start with something directional to help narrow your focus. For RAG (Retrieval-Augmented Generation) applications, consider these common evaluation techniques:
- Relevance: How well does the retrieved information match the query?
- Faithfulness: Does the generated response accurately reflect the retrieved information?
- Coherence: Is the response well-structured and logically consistent?
Adding a custom LLM-as-judge evaluation can also provide quick insights.
3. Experimentation and Iteration
With your dataset and evaluation metrics in place, it’s time to experiment. Identify the variables you can adjust, for example:
- Prompt variations
- Chunking strategies for RAG
- Embedding models
- Re-ranking techniques
- Foundation model selection
Run controlled experiments, changing one variable at a time. Use a tool like Parea to manage your experiments and analyze results.
Easily reference your datasets, add metadata, and run experiments via the SDK.
In the code snippet above, we are testing different foundation models for the generation portion of our RAG pipeline.
Analyze your results to identify patterns and areas for improvement. This iterative process forms the core of your development loop.
Advanced Techniques
As you become more familiar with the initial workflow, consider these advanced techniques to refine your application further. I won’t go into too much detail here, as each option could be its own post!
Human Annotation
When automated evaluations fall short, incorporate human annotation. Focus on specific aspects of performance and try to codify your decision-making process. With Parea, you can take as little as 20-30 manually annotated samples and use our eval bootstrapping feature to create a custom evaluation metric that is aligned with your annotations. Learn more
Dynamic Few-shot Examples
Leverage your growing dataset to inject relevant examples into your prompts dynamically. Select high-performing examples based on user feedback or evaluation metrics to guide the model toward better responses. Cookbook py/ts
Conclusion: Embracing Continuous Improvement
Building production-ready LLM applications is an ongoing process, not a one-time effort. You can rapidly iterate and improve your LLM-powered products by adopting a systematic workflow emphasizing real-world data, robust evaluation, and continuous experimentation.
What strategies have you found effective in getting (and keeping) your LLM applications production-ready?
Was this page helpful?