I used to be so bullish on Cerebras, being pretty certain that specialized chips would eventually dethrone NVidia for inference hardware.
Multiple years have passed since then. GPT spark got me excited again, but somehow that seems to have faded right back into obscurity.
Can somebody explain why there’s so little apparent progress here despite the theoretically massive advantage? Can I still expect this to happen eventually?
Because Cerebras handles large models poorly due to latency/bandwidth issues to main memory. See https://openai.com/index/introducing-gpt-5-3-codex-spark/ where its performance is significantly below that of the regular Codex 5.3, and can only handle a 128k text context window. For some use cases its great, but most would rather use a better, slower model.
In the future, they plan hybrid implementations, to be able to serve large models better, e.g.
"AWS. We signed a binding term sheet with Amazon Web Services for AWS to become the first hyperscaler to deploy Cerebras systems in its data centers. Deployment in AWS data centers will require us to meet strict standards for performance, scale, and reliability.Pursuant to the term sheet, we will create a co-designed, disaggregated inference-serving solution that will integrate AWS Trainium3 chips with Cerebras CS-3 systems, connected via high-bandwidth networking, to partition inference workloads across Trainium3 and CS-3. Each system will perform the type of computation at which it most excels. The approach is expected to deliver 5 times more token throughput in the same hardware footprint, at up to 15 times faster speeds compared to leading GPU-based solutions as benchmarked on leading open-source models."
Unclear if it's the only cause but wafer scale is great for very low latency, but loses to throughput per dollar compared to classic Nvidia like GPUs.
I don't think they can reduce the gap, SRAM is just more expensive than HBM and their architecture needs a lot of it.
So, the price makes it necessarily niche to some specific use-cases like HFT or intelligent duplex voice assistants, I'm still semi-bullish personally.
I don't think I'd buy this. It's the exact inverse of the mechanism that made AMD such a compelling investment.
I think wafer scale could improve performance of models and has some applications, but from a manufacturing perspective this approach seems cursed. Defect density is irrelevant when your target is an actual barn door. You can make the system resilient to defects, however the tradeoff is that you have to hedge for defects being anywhere. With chiplets, you accept that some units of space will be completely unusable. The trade off is that others are much higher performance because we don't have to spend any space or time on redundancy.
I used to be so bullish on Cerebras, being pretty certain that specialized chips would eventually dethrone NVidia for inference hardware.
Multiple years have passed since then. GPT spark got me excited again, but somehow that seems to have faded right back into obscurity.
Can somebody explain why there’s so little apparent progress here despite the theoretically massive advantage? Can I still expect this to happen eventually?
Because Cerebras handles large models poorly due to latency/bandwidth issues to main memory. See https://openai.com/index/introducing-gpt-5-3-codex-spark/ where its performance is significantly below that of the regular Codex 5.3, and can only handle a 128k text context window. For some use cases its great, but most would rather use a better, slower model.
In the future, they plan hybrid implementations, to be able to serve large models better, e.g.
"AWS. We signed a binding term sheet with Amazon Web Services for AWS to become the first hyperscaler to deploy Cerebras systems in its data centers. Deployment in AWS data centers will require us to meet strict standards for performance, scale, and reliability.Pursuant to the term sheet, we will create a co-designed, disaggregated inference-serving solution that will integrate AWS Trainium3 chips with Cerebras CS-3 systems, connected via high-bandwidth networking, to partition inference workloads across Trainium3 and CS-3. Each system will perform the type of computation at which it most excels. The approach is expected to deliver 5 times more token throughput in the same hardware footprint, at up to 15 times faster speeds compared to leading GPU-based solutions as benchmarked on leading open-source models."
Unclear if it's the only cause but wafer scale is great for very low latency, but loses to throughput per dollar compared to classic Nvidia like GPUs. I don't think they can reduce the gap, SRAM is just more expensive than HBM and their architecture needs a lot of it.
So, the price makes it necessarily niche to some specific use-cases like HFT or intelligent duplex voice assistants, I'm still semi-bullish personally.
The initial cost of serving is very high, and while super performant not great for scaling up.
In practice they are also not very flexible when compared to gpus.
Taalas seems to be pretty good. This is their demo: https://chatjimmy.ai/
This might be even more limited though. They can't physically fit a large model on a single chip.
Not yet. But it is definitely limited, since they can only serve a single model essentially.
I don't think I'd buy this. It's the exact inverse of the mechanism that made AMD such a compelling investment.
I think wafer scale could improve performance of models and has some applications, but from a manufacturing perspective this approach seems cursed. Defect density is irrelevant when your target is an actual barn door. You can make the system resilient to defects, however the tradeoff is that you have to hedge for defects being anywhere. With chiplets, you accept that some units of space will be completely unusable. The trade off is that others are much higher performance because we don't have to spend any space or time on redundancy.
We need more personal level AI solutions instead of so much corporate centered solutions.