Thats great feedback - thank you! I would then assume most people would end up using open source models as they tend to be cheaper with faster inference
We've actually found the opposite -- Every client project has been based on GPT4 or Gemini, with one exception for a highly sensitive use case based on Llama3.1.
The main reason is that the APIs represent an excellent cost / performance / complexity tradeoff.
Every project has relied primarily on the big models because the small models just aren't as capable in a business context.
We have found that Gpt4o is very fast, when that's necessary (often it's not), and it's also very cheap (gpt4o batch is ~96% cheaper than the original GPT4). And where cost is a concern and reasoning doesn't need to be as good as possible, gpt4o mini has been excellent too.
To answer the headline - No.
I find that working and adjusting my prompts and context is way higher value than A/B testing LLMs.
After all, I will never expect 100% accuracy.
I feel like they are reaching commodity status and the result quality is so similar, that it just doesn't really matter what you use.
Thanks, do you think there is value in evals? / do you use evals while testing prompts?
Jumping in based on our experience in case it's helpful:
- Evals are very useful and a core part of the best practice for creating LLM apps
- There are already excellent solutions for model and prompt eval (etc.) including Parea from YC and many others
+1
There are already many prompt/LLM routers available.
We've never found value in them, for similar reasons as mentioned above.
Thats great feedback - thank you! I would then assume most people would end up using open source models as they tend to be cheaper with faster inference
We've actually found the opposite -- Every client project has been based on GPT4 or Gemini, with one exception for a highly sensitive use case based on Llama3.1.
The main reason is that the APIs represent an excellent cost / performance / complexity tradeoff.
Every project has relied primarily on the big models because the small models just aren't as capable in a business context.
We have found that Gpt4o is very fast, when that's necessary (often it's not), and it's also very cheap (gpt4o batch is ~96% cheaper than the original GPT4). And where cost is a concern and reasoning doesn't need to be as good as possible, gpt4o mini has been excellent too.