Isn’t this basically the Swiss cheese model? If your two input AIs hallucinate, or your consensus AI misunderstands the input, you will still have confabulations in the output?
From all my testing, this never really happened even once honestly, plus the judge model (that I've kept strictly a reasoning model) also evaluates individually before "judging" the consensus.
I have this same thought, and have tried similar approaches.
OP: Have you trained or fine tuned a model that specifically reasons the worker model inputs against the user input? Or is this basically just taking a model and turning the temperature down to near 0?
Honestly, personal use cases. I am a STEM student and deal with a lot of "hard" questions that are about 60% of the time miscalculated by LLMs, I used to manually paste in approaches from say ChatGPT to DeepSeek and now grok and asked them what do you think is better. I created this out of necessity to automate this then realized how cool it can be if it scales further haha
Eventually yes, that's the plan! It's extremely good with code too, especially with more vague requests, tends to take about 2-3 rounds but almost always gets a great approach.
Isn’t this basically the Swiss cheese model? If your two input AIs hallucinate, or your consensus AI misunderstands the input, you will still have confabulations in the output?
From all my testing, this never really happened even once honestly, plus the judge model (that I've kept strictly a reasoning model) also evaluates individually before "judging" the consensus.
I have this same thought, and have tried similar approaches.
OP: Have you trained or fine tuned a model that specifically reasons the worker model inputs against the user input? Or is this basically just taking a model and turning the temperature down to near 0?
Low temperature, heavy prompting to answer in a structured way. Sadly can't fine train models since this is API based but the approach does work!
I’m genuinely interested in how you arrived at the concept of using AI as a method to treat hallucinations. What inspired that approach?
Honestly, personal use cases. I am a STEM student and deal with a lot of "hard" questions that are about 60% of the time miscalculated by LLMs, I used to manually paste in approaches from say ChatGPT to DeepSeek and now grok and asked them what do you think is better. I created this out of necessity to automate this then realized how cool it can be if it scales further haha
not op but LLM as Judge is a thing https://arxiv.org/abs/2411.15594
Very interesting. Will this be available as a meta model via API, allowing use in the coding tool of my choice?
Eventually yes, that's the plan! It's extremely good with code too, especially with more vague requests, tends to take about 2-3 rounds but almost always gets a great approach.