Edit 2: the fact that they're going straight for an end-to-end coding product on day 1 is very ambitious. Other speed/efficiency-oriented AI companies (Cerebras and Inception come to mind) still don't have a first-party coding product after years. IMO this is absolutely the right way to go if they really do have the big breakthrough they're claiming.
> The core idea is content-dependent selection. For each query, the model selects which parts of the sequence are worth attending to, and computes attention exactly over those positions.
I don't know if this will help for things like understanding code, where the all relevant parts can be the file of 1000 lines that we are analyzing, and where every token is relevant in understanding recursion, loops, function calls, etc.
This sounds like it would be great to do SSA before passing things along to a code model like claude code.
I’m very surprised this isn’t getting more attention. Am I missing something?
It seems at or above SOTA on the given benchmarks, doesn’t have context rot, is orders of magnitude faster, and uses less compute that current transformer models. I suppose it’s just an announcement and we can’t test it ourselves yet.
We are SOTA in some ways and not in others, continuously working to make it better! We need a little more time to scale, as we are working on things like disaggregated prefill, etc., the norms of large-scale model infra.
This seems super cool if as described, but I'm sure you can understand the skepticism.
Do you anticipate having any kind of public accessible chat interface for testing in the near future?
Also, what, if any, benefits are there for smaller context windows? Is there still a material improvement in cost to serve under say 256k? I'm curious about the broader implications for the space beyond improvements for very large context windows.
I do, for sure! Yes, we have a few product rollouts lined up. The differentials for latency are posted in our blog post, so that should provide an idea of where the scaling law differentials kick in.
The proof is in the pudding. At this point, there have been plenty of models that overperformed on benchmarks and underperformed on real work. So my stance is that I'm curious, I'm excited to see where it goes, and I don't believe it until I can try it.
- magic.dev claimed 200M context window and it's been two years since and no real product yet.
- They are admitting that this is built on top of a Chinese model[1]
- They committed a huge chart crime with the Y axis of a chart comparing to Opus on their website that I can't find anymore (Too embarrassing to keep?). The delta between their score (81%) vs. Opus (87%) on SWE bench was hugely minimized
- They named the company subquadratic but in parts they said O(1) linear scaling. At O(1) you could do much more than 12M tokens context window. At O(log n) even.
Ah, I nearly forgot about magic.dev. I took a quick peek to check up on them. Welp, last social/blog activity was in... 2024. But hey, their careers page still says they're hiring! So they must be doing just fine.
There are some comments which are identical to comments on X as well. That is not the say the frontier labs do not engage in highly unethical marketing, but this is a little bit too obvious.
Neither is cost, and latency, in the long-term. LLMs ultimately become more economically viable than they are now, and broaden the scope of every existing LLM-driven application (particularly STS, conversational AI, etc, etc.)
Assuming this is real and much better than existing linear attention methods as advertised, not launching with a technical report is a big miss.
Edit: their blog post (https://subq.ai/how-ssa-makes-long-context-practical) does go pretty in-depth about it
Edit 2: the fact that they're going straight for an end-to-end coding product on day 1 is very ambitious. Other speed/efficiency-oriented AI companies (Cerebras and Inception come to mind) still don't have a first-party coding product after years. IMO this is absolutely the right way to go if they really do have the big breakthrough they're claiming.
> The core idea is content-dependent selection. For each query, the model selects which parts of the sequence are worth attending to, and computes attention exactly over those positions.
I don't know if this will help for things like understanding code, where the all relevant parts can be the file of 1000 lines that we are analyzing, and where every token is relevant in understanding recursion, loops, function calls, etc.
This sounds like it would be great to do SSA before passing things along to a code model like claude code.
Let me know if I misunderstood
I wonder how different their method actually is from other sub-quadratic sparse attention methods like Reformer [1] and Routing Transformer [2].
[1]: https://arxiv.org/abs/2001.04451
[2]: https://arxiv.org/abs/2003.05997
I’m very surprised this isn’t getting more attention. Am I missing something?
It seems at or above SOTA on the given benchmarks, doesn’t have context rot, is orders of magnitude faster, and uses less compute that current transformer models. I suppose it’s just an announcement and we can’t test it ourselves yet.
We are SOTA in some ways and not in others, continuously working to make it better! We need a little more time to scale, as we are working on things like disaggregated prefill, etc., the norms of large-scale model infra.
I am happy to answer any questions!
This seems super cool if as described, but I'm sure you can understand the skepticism.
Do you anticipate having any kind of public accessible chat interface for testing in the near future?
Also, what, if any, benefits are there for smaller context windows? Is there still a material improvement in cost to serve under say 256k? I'm curious about the broader implications for the space beyond improvements for very large context windows.
I do, for sure! Yes, we have a few product rollouts lined up. The differentials for latency are posted in our blog post, so that should provide an idea of where the scaling law differentials kick in.
The proof is in the pudding. At this point, there have been plenty of models that overperformed on benchmarks and underperformed on real work. So my stance is that I'm curious, I'm excited to see where it goes, and I don't believe it until I can try it.
I agree, it's a real architectural breakthrough if true
no one has access to it yet
no published benchmarks
no paper
no demonstrations of capabilities
- magic.dev claimed 200M context window and it's been two years since and no real product yet.
- They are admitting that this is built on top of a Chinese model[1]
- They committed a huge chart crime with the Y axis of a chart comparing to Opus on their website that I can't find anymore (Too embarrassing to keep?). The delta between their score (81%) vs. Opus (87%) on SWE bench was hugely minimized
- They named the company subquadratic but in parts they said O(1) linear scaling. At O(1) you could do much more than 12M tokens context window. At O(log n) even.
I hope this is real but I doubt...
Ah, I nearly forgot about magic.dev. I took a quick peek to check up on them. Welp, last social/blog activity was in... 2024. But hey, their careers page still says they're hiring! So they must be doing just fine.
Whether this is real or not, multiple commenters here look like astroturfers - created in the past year (or hours) with very low karma
There are some comments which are identical to comments on X as well. That is not the say the frontier labs do not engage in highly unethical marketing, but this is a little bit too obvious.
This is pretty remarkable. We've spent a lot of time finding workarounds for LLMs reading long docs. Now that's gone.
Looks like long context isn’t a problem anymore
Neither is cost, and latency, in the long-term. LLMs ultimately become more economically viable than they are now, and broaden the scope of every existing LLM-driven application (particularly STS, conversational AI, etc, etc.)
if it's true then it's a breakthrough.
optimizing AI in general. How cool is that?
[dead]