We help companies build & improve their AI products with our hands-own services. Request a consultation here.
1. Hypothesis Testing with Prompt Engineering
Start simple. Begin with one of the leading foundation models and test your hypothesis using basic prompt engineering. The goal here isn’t perfection but establishing a baseline and confirming that the LLM can produce reasonable responses for your use case. If you already have a prompt/chain in place, skip to the next step. Don’t spend too much time engineering your prompt at this stage. After you collect more data, you can make hypothesis-driven tweaks.2. Dataset Creation from Users
Use an observability tool to collect real user questions and responses. It’s okay if you aren’t collecting feedback yet; first, you want a diverse dataset to iterate on. Tools like Parea can simplify this process. Use our py/ts SDK to quickly instrument your code, and then use your logs to create a dataset. The next step is to establish an evaluation metric.
Developing effective evaluation metrics is crucial but challenging. Start with something directional to help narrow your focus.
For RAG (Retrieval-Augmented Generation) applications, consider these common evaluation techniques:
- Relevance: How well does the retrieved information match the query?
- Faithfulness: Does the generated response accurately reflect the retrieved information?
- Coherence: Is the response well-structured and logically consistent?
python
3. Experimentation and Iteration
With your dataset and evaluation metrics in place, it’s time to experiment. Identify the variables you can adjust, for example:- Prompt variations
- Chunking strategies for RAG
- Embedding models
- Re-ranking techniques
- Foundation model selection
Easily reference your datasets, add metadata, and run experiments via the SDK.
Analyze your results to identify patterns and areas for improvement. This iterative process forms the core of your development loop.

