6 comments

  • noemit 7 hours ago ago

    To answer the headline - No.

    I find that working and adjusting my prompts and context is way higher value than A/B testing LLMs.

    After all, I will never expect 100% accuracy.

    I feel like they are reaching commodity status and the result quality is so similar, that it just doesn't really matter what you use.

    • k11kirky 6 hours ago ago

      Thanks, do you think there is value in evals? / do you use evals while testing prompts?

      • hypoxia87 3 hours ago ago

        Jumping in based on our experience in case it's helpful:

        - Evals are very useful and a core part of the best practice for creating LLM apps

        - There are already excellent solutions for model and prompt eval (etc.) including Parea from YC and many others

    • hypoxia87 7 hours ago ago

      +1

      There are already many prompt/LLM routers available.

      We've never found value in them, for similar reasons as mentioned above.

      • k11kirky 6 hours ago ago

        Thats great feedback - thank you! I would then assume most people would end up using open source models as they tend to be cheaper with faster inference

        • hypoxia87 3 hours ago ago

          We've actually found the opposite -- Every client project has been based on GPT4 or Gemini, with one exception for a highly sensitive use case based on Llama3.1.

          The main reason is that the APIs represent an excellent cost / performance / complexity tradeoff.

          Every project has relied primarily on the big models because the small models just aren't as capable in a business context.

          We have found that Gpt4o is very fast, when that's necessary (often it's not), and it's also very cheap (gpt4o batch is ~96% cheaper than the original GPT4). And where cost is a concern and reasoning doesn't need to be as good as possible, gpt4o mini has been excellent too.