People are describing the results when they say models are non-deterministic. Give it the same exact input twice, and you'll get two different outputs. Deterministic would mean the same input always gives the same output.
Even then, depending on the specific implementation, associativity of floating point could be an issue between batch sizes, between exactly how KV cache is implemented, etc.
That's still an inference time issue. If you have perfect inference with a zero temperature, the models are deterministic. There is no intrinsic randomness in software-only computing.
Floating point associativity differences can lead to non-determinism with 0 temperature if the order of operations are non-deterministic.
Anyone with reasonable experience with GPU computation who pays attention knows that even randomness in warp completion times can easy lead to non-determinism due to associativity differences.
It is very well known that CUDA isn't strongly deterministic due to these factors among practitioners.
Differences in batch sizes of inference compound these issues.
Edit: to be more specific, the non-determinism mostly comes from map-reduce style operations, where the map is deterministic, but the order that items are sent to the reduce steps (or how elements are arranged in the tree for a tree reduce) can be non-deterministic.
Eh., if you have a PyTorch model that uses non-deterministic tensor operations like matrix multiplications, I think it is fair to call the model non-deterministic, since the matmul is not guaranteed to be deterministic - the non determinism of a matmul isn't a bug but a feature.
Benchmarks like this one are designed to thoroughly test the model across several iterations. 15% is a MASSIVE discrepancy.
Come on Anthropic, admit what you're doing already and let us access your best models unhindered, even if it costs us more. At the moment we just all feel short-changed.
Computational semiotics has been empirically proven. Model releasing soon. In the mean time, for the love of god someone recognize this and help blow these numbers out of the water.
Because the website doesn't seem to show any sample size of runs, I assume they ran it once across the suite.
The models are nondeterministic, and therefore it's pretty normal for different runs to give different results.
I don't see this as evidence that Opus 4.6 has gotten worse.
>I don't see this as evidence that Opus 4.6 has gotten worse.
I see it as corroboration evidence of actual everyday experience.
Also, any reason to imply "BridgeBench", apparently dedicated to AI benchmarking, wouldn't have run it more than once across the suite?
I would love to know what you’re doing in the harness to not feel the total degradation in experience now in comparison to December & January.
> The models are nondeterministic, and therefore it's pretty normal for different runs to give different results.
And how is that an excuse?
I don't care about how good a model could be. I care about how good a model was on my run.
Consequently, my opinion on a model is going to be based around its worst performance, not its best.
As such, this qualifies as strong evidence that Opus 4.6 has gotten worse.
>> The models are nondeterministic, and therefore it's pretty normal for different runs to give different results.
> And how is that an excuse? […] this qualifies as strong evidence…
This qualifies as nothing due to how random processes work, that’s what the gp is saying. The numbers are not reliable if it’s just one run.
If this is counter-intuitive, a refresher on basic statistics and probability theory may be in order.
No, what they're saying is the previous run could have just been lucky and not representative!
are models really non deterministic?
People are describing the results when they say models are non-deterministic. Give it the same exact input twice, and you'll get two different outputs. Deterministic would mean the same input always gives the same output.
Yes. Look up LLM "temperature" - it's an internal parameter that tweaks how deterministic they behave.
The models are deterministic, the inference is not.
Which is a useless distinction. When we say models in this context we mean the whole LLM + infrastructure to serve it (including caches, etc).
What does that even mean?
Even then, depending on the specific implementation, associativity of floating point could be an issue between batch sizes, between exactly how KV cache is implemented, etc.
That's still an inference time issue. If you have perfect inference with a zero temperature, the models are deterministic. There is no intrinsic randomness in software-only computing.
Floating point associativity differences can lead to non-determinism with 0 temperature if the order of operations are non-deterministic.
Anyone with reasonable experience with GPU computation who pays attention knows that even randomness in warp completion times can easy lead to non-determinism due to associativity differences.
For instance: https://www.twosigma.com/articles/a-workaround-for-non-deter...
It is very well known that CUDA isn't strongly deterministic due to these factors among practitioners.
Differences in batch sizes of inference compound these issues.
Edit: to be more specific, the non-determinism mostly comes from map-reduce style operations, where the map is deterministic, but the order that items are sent to the reduce steps (or how elements are arranged in the tree for a tree reduce) can be non-deterministic.
My point is, your inference process is the non-deterministic part; not the model itself.
Eh., if you have a PyTorch model that uses non-deterministic tensor operations like matrix multiplications, I think it is fair to call the model non-deterministic, since the matmul is not guaranteed to be deterministic - the non determinism of a matmul isn't a bug but a feature.
See e.g.https://discuss.pytorch.org/t/why-is-torch-mm-non-determinis...
Benchmarks like this one are designed to thoroughly test the model across several iterations. 15% is a MASSIVE discrepancy.
Come on Anthropic, admit what you're doing already and let us access your best models unhindered, even if it costs us more. At the moment we just all feel short-changed.
Computational semiotics has been empirically proven. Model releasing soon. In the mean time, for the love of god someone recognize this and help blow these numbers out of the water.
https://open.substack.com/pub/sublius/p/the-semiotic-reflexi...