How are you approaching this? I assume it's some combo of unit tests and integration tests where you are making sure the response is generally consistent across multiple runs of the same prompt - or if you need to change the prompt to make sure the result was the same as before.
From what i've seen using LLMs, your best bet is to have evals for 100ish (if possible) examples with know ground truths. This way you will statistically get results of how accurate your LLM prompt is working. Having more examples will help increase the precision when hallucinations come in.
Unfortunately things get a little harder with qualitative responses, where you are expecting certain words or sentences in the response. Your best bet here is to also have 100 examples of what you expect the response would be and a form of semantic similarity comparison between the response and your ground truth.
How are you approaching this? I assume it's some combo of unit tests and integration tests where you are making sure the response is generally consistent across multiple runs of the same prompt - or if you need to change the prompt to make sure the result was the same as before.
From what i've seen using LLMs, your best bet is to have evals for 100ish (if possible) examples with know ground truths. This way you will statistically get results of how accurate your LLM prompt is working. Having more examples will help increase the precision when hallucinations come in.
Unfortunately things get a little harder with qualitative responses, where you are expecting certain words or sentences in the response. Your best bet here is to also have 100 examples of what you expect the response would be and a form of semantic similarity comparison between the response and your ground truth.
You want to do evals, yeah.
Have the LLM evaluate its own response. User to LLM to LLM (validates its own response) to User.