Probably a testament to how good Qwen3.6 is considering Qwen3.6-35B-A3B is not only ahead of their similar weight class XS.2 but also their M.1 (close to 10x bigger at 225B-A23B).
Interestingly, Gemma 4 26B-A4B and Qwen3.6 27B (dense) have been left out of the comparison.
The smaller models are becoming very good and quantization techniques like importance weighting and TurboQuant on model weights let you run aggressively quantized version (IQ2, TQ3_4S) on consumer hardware with extremely acceptable perplexity and quality loss.
Been testing these via their "pool" agent. It's fast, and the agent adheres to the ACP spec pretty well (better than codex, opencode etc.) so it's a good experience in Zed.
I've tried their "shimmer" site https://shimmer.poolside.ai (seems in the same vein of products as AI Studio/ Repl.it)
Either the harness or the models are very bad in it, I'd say they feels less capable than Gemma4-E2B in virtually any harness. The larger model would plan out some steps and never actually perform them even when prompted to several times. The smaller model actually got more done. My guess is it's the harness, since you've had a good experience. Haven't tried the pool cli yet.
The fact theyre shipping the actual agent harness alongside the weights is the part that matters. Most labs dump the model and make you figure out the agent layer yourself. If its the same runtime they use for RL training, its actually been exercised in production rather than being some demo wrapper.
Just by the (lack of) inter-model variance, I don't think SWEBench-Pro does a very good job of representing model capability. Terminal-Bench seems more challenging and separates the wheat from the chaff.
Also, *ops work, which in my experience can actually be more complicated than SWE is underrepresented there obviously.
I'm not sure I understand why Poolside are training their own models at all - what's the perceived or real advantage of splitting up model training efforts into smaller companies and dividing up resources like this? Is it just to have a US-domiciled LLM lab?
Very cool to see more small open models being worked on!
One nit: I've seen on this homepage, and many others, this notion that the people behind the models are "working towards AGI".
I get that this is marketing speak, but transformers are not AGI, and they will never be AGI, so it'd be great if people stopped saying that as it sort of wears out the meaning of "working towards AGI".
> but transformers are not AGI, and they will never be AGI
Like the claim "transformers are AGI", this needs proof, otherwise should be prefixed "I think". And honestly, positive proof is easier than negative proof (you just need to make one transformer model that is a AGI, whereas the never claim requires you to enumerated all possibilities).
That's like saying we should wait for positive proof of AGI from combustion engines. That'll never happen, no matter how much you tweak the engine. It's just not possible.
The negative proof is there in the definition itself. Transformers are not AGI, they're frozen human intelligence of the autocomplete variety. That can never be AGI and anyone who says otherwise doesn't understand transformers or AGI.
This kind of proof isn't really as water tight as you claim. It's a lot like saying state machines are limited to processing regular expressions, and then completely ignoring how easy it is to add a stack or linear memory to a state machine to make it a PDA or Turing machine.
So yes, the LLMs can be trivialized as just randomized autocomplete, but if you add a database or memory to the side very basic MLPs can become a Turing machine. It's going to take a lot more proof to say a Turing machine could never be intelligent. And you can do more than just give the LLM side memory - you can have them invoked recursively, use message passing as coroutines, and so on...
You might be technically correct if you ignore anything other than the very restrictive definitions you're using, but even there I'm not certain. If you had a LLM with a trillion token window, is that good enough to act as a memory? Human brains aren't infinite either.
Agreed. It is nonsensical to argue that a 3B transformer that hard-capped to decode 100 tokens is "intelligent". Of course when we are evaluating whether "transformers" is intelligent or not, we are talking about taking transformers as a core part of the system in some ways and enhance it with some other means (as you said, it is pretty trivial to making transformers a Turing machine, hence can carry out any compute, including intelligence (if you are in the camp that intelligence is computable, I don't think it makes sense to argue with anyone who otherwise believes intelligence is not computable)).
Lol, I totally agree about anyone using the non-computable angle.
However, I've got a 20GB GGUF file on my disk that can write code better than 99% of the people I ever worked with in the last 25 years, and ravens seem pretty clever with about 2 billion neurons... I have no idea what the lower bound is.
> Transformers have approximate knowledge of many things. Is this not 'general'?
Of course not. That's like saying the Encyclopedia Britannica is AGI.
> What does AGI mean to you?
I would define AGI as human-like machine intelligence (or superior).
This is difficult for some people to understand because they don't understand what "human-like" means in the first place. Neuroscientists would be able to set some of these wayward computer scientists straight on this question.
I can see how that would be implied by my comments so you're right to question that.
The principles that are found in the brain are what gives qualification to "AGI", not the brain itself, so it's possible there are other architectures that would qualify.
A few observations on LLMs that give the game away:
- They require releases. You get a single binary blob and that blob is forever stuck at its so-called "intelligence" level. It never learns anything new.
- They're stuck approaching the limit of human intelligence. This is because the technique cannot exceed human intelligence. I realize that OpenAI has made claims to the contrary, saying things like "oh our model found out some proof that was never proven before" — this doesn't count. It's a side effect of training on the Internet. In fact that proof probably did exist (in pieces) somewhere on the Internet, it just wasn't widely publicized.
So, you'll know it's AGI when you no longer see companies releasing new models. AGI won't require new models because the architecture will be what matters as whatever models you have will be constantly updating themselves in real-time, just like the human brain does (and every other brain).
And, you'll start to see the AIs actually outsmarting the smartest humans on the planet in every subject.
> - They require releases. You get a single binary blob and that blob is forever stuck at its so-called "intelligence" level. It never learns anything new.
True. But learning isn't the same thing as intelligence. My father who has dementia and is unable to learn anything new due to memory issues is still 'intelligent'.
> - They're stuck approaching the limit of human intelligence.
Is general intelligence > human intelligence then? Is there some static 'human level' that I should be measuring myself against?
There is considerable overlap between the smartest bear and the dumbest human. same is true with LLM's and humans how.
What you seem to be describing isn't AG(eneral)I, but artificial greater intelligence.
> What you seem to be describing isn't AG(eneral)I, but artificial greater intelligence.
If you ignore what I said in answer to you earlier then perhaps it would make sense to draw this conclusion. But if you take the full context of what I said then no, it's clear that I am not referring to "artificial greater intelligence".
Just in the previous comment I said that rats would qualify, because the architecture is what matters.
Your example with dementia is clever but that's an example of the biological architecture breaking down. Please forgive the crude analogy but it's like asking if a house is still a house if it's been burned down partially. I suppose part of it is still a house.
FWIW there are other definitions of intelligence that are wholly immaterial.
Spirits are considered intelligent even though they have no body because they are composed of pure non-physical consciousness. Plants are intelligent even though they also have no brain.
That fundamental sort of living conscious intelligence isn't what I see discussed much in these contexts though.
What you will notice about it though is that unlike frozen LLMs, this type of intelligence also has the capacity to change, interact, and learn from its environment.
If we go with this definition instead, then on a large enough timescale everything can be considered intelligent, even rocks.
...Let's not go with the nonsense definitions then.
I agree, systems don't need a brain to be intelligent, and (on a related point:) I don't think systems need to be conscious to be 'intelligent'.
You are excluding this system (llm+harness) that learns (separately), can modify it's surrounding environment via a shell interface (including setting up a nightly training loop to reweight itself based on it's daily actions and interactions) from being intelligent. Do I have that right? Or are you thinking in terms of 'only' the LLM?
I do call openclaw style agents "living agents", although they might be closer to a kind of zombie. Living agents like openclaw et. al. do have a self-modifying property of sorts thanks to their memory, and so that system might be more AGI-ish, but, still, it has a fundamental cap to its potential, which remains frozen at the LLM.
> (including setting up a nightly training loop to reweight itself based on it's daily actions and interactions) from being intelligent
I'd have a harder time arguing that sort of system isn't AGI.
My point is learning may be required to create intelligence, but not 'run' intelligence. And LLM's 'learn' in their training, no? It happening at a different times doesn't truly matter.
Agreed. The widespread anthropomorphizing is getting so tiring.
I blame it on the big companies in the space, but seeing intelligent folks regularly attributing intelligence to a complex autocomplete system is disappointing.
Probably a testament to how good Qwen3.6 is considering Qwen3.6-35B-A3B is not only ahead of their similar weight class XS.2 but also their M.1 (close to 10x bigger at 225B-A23B).
Interestingly, Gemma 4 26B-A4B and Qwen3.6 27B (dense) have been left out of the comparison.
The smaller models are becoming very good and quantization techniques like importance weighting and TurboQuant on model weights let you run aggressively quantized version (IQ2, TQ3_4S) on consumer hardware with extremely acceptable perplexity and quality loss.
Very exciting times for local LLMs.
Designer @ Poolside here. Thank you for all the feedback – specifically we rebuilt all the eval charts to be clearer. https://poolside.ai/blog/laguna-a-deeper-dive
Been testing these via their "pool" agent. It's fast, and the agent adheres to the ACP spec pretty well (better than codex, opencode etc.) so it's a good experience in Zed.
I've tried their "shimmer" site https://shimmer.poolside.ai (seems in the same vein of products as AI Studio/ Repl.it)
Either the harness or the models are very bad in it, I'd say they feels less capable than Gemma4-E2B in virtually any harness. The larger model would plan out some steps and never actually perform them even when prompted to several times. The smaller model actually got more done. My guess is it's the harness, since you've had a good experience. Haven't tried the pool cli yet.
The fact theyre shipping the actual agent harness alongside the weights is the part that matters. Most labs dump the model and make you figure out the agent layer yourself. If its the same runtime they use for RL training, its actually been exercised in production rather than being some demo wrapper.
Felt like they would never come out of stealth mode but very nice to see it materialized into something competitive.
Not sure if this is competitive, look at the numbers for Qwen3.6
What makes them distinctive?
answering my own question: tl;dr they speak enterprise, we'll train using YOUR code and run on YOUR stuff and we have a Model Factory.
Pelicans via OpenRouter - the M.1 one is better, neither are particularly great though: https://gist.github.com/simonw/382464026d2e3535986e06437fb6d...
I actually like the XS one. It's broken, but it's very aesthetic
For similarly sized models, not looking very good on the slightly-less-benchmaxxed Terminal-Bench 2.0:
Quite a huge lead for Qwen... well, at least it's catching up to other smaller Western labs.Need to look at SWEBench-Pro, it's super competitive. Suspect they'll catch up given the longer-tail on TB scores.
Just by the (lack of) inter-model variance, I don't think SWEBench-Pro does a very good job of representing model capability. Terminal-Bench seems more challenging and separates the wheat from the chaff.
Also, *ops work, which in my experience can actually be more complicated than SWE is underrepresented there obviously.
Has anyone tried these models?
I like their honesty in benchmarks, looks like Qwen3.6 35B is outperforming their Laguna M.1 225B model
I'm not sure I understand why Poolside are training their own models at all - what's the perceived or real advantage of splitting up model training efforts into smaller companies and dividing up resources like this? Is it just to have a US-domiciled LLM lab?
Limiting resources steers research towards different routes.
The colors used in the charts are borderline criminal
The order of the bars does not even follow the order in the legend unless I'm mistaken, that's insane.
Please update the charts. Consider using textures or filling patterns.
I usually score pretty well in colour perception tests but distinguishing between those two purples made me doubt myself.
We got a lot of feedback about this. We've just pushed new chart components live across the site – thanks. https://poolside.ai/blog/laguna-a-deeper-dive
My phone is in grayscale to make it less interesting (I still watch way too many videos in grayscale but it helps) so I’m right with you
Very cool to see more small open models being worked on!
One nit: I've seen on this homepage, and many others, this notion that the people behind the models are "working towards AGI".
I get that this is marketing speak, but transformers are not AGI, and they will never be AGI, so it'd be great if people stopped saying that as it sort of wears out the meaning of "working towards AGI".
These people worked on applying ML to code and towards producing code using ML before the transformers paper even came out (2017).
https://web.archive.org/web/20170629103718/https://blog.sour...
> but transformers are not AGI, and they will never be AGI
Like the claim "transformers are AGI", this needs proof, otherwise should be prefixed "I think". And honestly, positive proof is easier than negative proof (you just need to make one transformer model that is a AGI, whereas the never claim requires you to enumerated all possibilities).
That's like saying we should wait for positive proof of AGI from combustion engines. That'll never happen, no matter how much you tweak the engine. It's just not possible.
The negative proof is there in the definition itself. Transformers are not AGI, they're frozen human intelligence of the autocomplete variety. That can never be AGI and anyone who says otherwise doesn't understand transformers or AGI.
This kind of proof isn't really as water tight as you claim. It's a lot like saying state machines are limited to processing regular expressions, and then completely ignoring how easy it is to add a stack or linear memory to a state machine to make it a PDA or Turing machine.
So yes, the LLMs can be trivialized as just randomized autocomplete, but if you add a database or memory to the side very basic MLPs can become a Turing machine. It's going to take a lot more proof to say a Turing machine could never be intelligent. And you can do more than just give the LLM side memory - you can have them invoked recursively, use message passing as coroutines, and so on...
You might be technically correct if you ignore anything other than the very restrictive definitions you're using, but even there I'm not certain. If you had a LLM with a trillion token window, is that good enough to act as a memory? Human brains aren't infinite either.
Agreed. It is nonsensical to argue that a 3B transformer that hard-capped to decode 100 tokens is "intelligent". Of course when we are evaluating whether "transformers" is intelligent or not, we are talking about taking transformers as a core part of the system in some ways and enhance it with some other means (as you said, it is pretty trivial to making transformers a Turing machine, hence can carry out any compute, including intelligence (if you are in the camp that intelligence is computable, I don't think it makes sense to argue with anyone who otherwise believes intelligence is not computable)).
Lol, I totally agree about anyone using the non-computable angle.
However, I've got a 20GB GGUF file on my disk that can write code better than 99% of the people I ever worked with in the last 25 years, and ravens seem pretty clever with about 2 billion neurons... I have no idea what the lower bound is.
Fun to think about though :-)
You are super positive that transformers can't become AGI, wow. Care to explain how atoms _can_ become AGI?
Oh! would you mind explaining that out a bit? :)
See the adjacent thread with @altruios.
What does AGI mean to you?
Transformers have approximate knowledge of many things. Is this not 'general'? Where is the goalpost here?
> Transformers have approximate knowledge of many things. Is this not 'general'?
Of course not. That's like saying the Encyclopedia Britannica is AGI.
> What does AGI mean to you?
I would define AGI as human-like machine intelligence (or superior).
This is difficult for some people to understand because they don't understand what "human-like" means in the first place. Neuroscientists would be able to set some of these wayward computer scientists straight on this question.
> human-like
But is that a hard requirement? Can a machine have Rat-like intelligence? Is all intelligence human-like (human-centric-mind-blindness-much?)?
> Of course not. That's like saying the Encyclopedia Britannica is AGI.
Well, I'd classify that as GK, general knowledge. Not artificial or intelligent.
Let's consider a definition of intelligence as the act of 'manipulating data', have you a better general definition of intelligence?
> But is that a hard requirement?
Yes.
> Can a machine have Rat-like intelligence?
Yes, and that would be closer to AGI than today's LLMs, because the fundamental principles and architecture is there.
Okay. So to be clear, you believe that replicating/templating a brain is the ONLY way to make an intelligent machine?
What makes you think that? That there are no other patterns of intelligence?
I can see how that would be implied by my comments so you're right to question that.
The principles that are found in the brain are what gives qualification to "AGI", not the brain itself, so it's possible there are other architectures that would qualify.
A few observations on LLMs that give the game away:
- They require releases. You get a single binary blob and that blob is forever stuck at its so-called "intelligence" level. It never learns anything new.
- They're stuck approaching the limit of human intelligence. This is because the technique cannot exceed human intelligence. I realize that OpenAI has made claims to the contrary, saying things like "oh our model found out some proof that was never proven before" — this doesn't count. It's a side effect of training on the Internet. In fact that proof probably did exist (in pieces) somewhere on the Internet, it just wasn't widely publicized.
So, you'll know it's AGI when you no longer see companies releasing new models. AGI won't require new models because the architecture will be what matters as whatever models you have will be constantly updating themselves in real-time, just like the human brain does (and every other brain).
And, you'll start to see the AIs actually outsmarting the smartest humans on the planet in every subject.
> - They require releases. You get a single binary blob and that blob is forever stuck at its so-called "intelligence" level. It never learns anything new.
True. But learning isn't the same thing as intelligence. My father who has dementia and is unable to learn anything new due to memory issues is still 'intelligent'.
> - They're stuck approaching the limit of human intelligence.
Is general intelligence > human intelligence then? Is there some static 'human level' that I should be measuring myself against?
There is considerable overlap between the smartest bear and the dumbest human. same is true with LLM's and humans how.
What you seem to be describing isn't AG(eneral)I, but artificial greater intelligence.
> What you seem to be describing isn't AG(eneral)I, but artificial greater intelligence.
If you ignore what I said in answer to you earlier then perhaps it would make sense to draw this conclusion. But if you take the full context of what I said then no, it's clear that I am not referring to "artificial greater intelligence".
Just in the previous comment I said that rats would qualify, because the architecture is what matters.
Your example with dementia is clever but that's an example of the biological architecture breaking down. Please forgive the crude analogy but it's like asking if a house is still a house if it's been burned down partially. I suppose part of it is still a house.
FWIW there are other definitions of intelligence that are wholly immaterial.
Spirits are considered intelligent even though they have no body because they are composed of pure non-physical consciousness. Plants are intelligent even though they also have no brain.
That fundamental sort of living conscious intelligence isn't what I see discussed much in these contexts though.
What you will notice about it though is that unlike frozen LLMs, this type of intelligence also has the capacity to change, interact, and learn from its environment.
If we go with this definition instead, then on a large enough timescale everything can be considered intelligent, even rocks.
>If we go with this definition instead
...Let's not go with the nonsense definitions then.
I agree, systems don't need a brain to be intelligent, and (on a related point:) I don't think systems need to be conscious to be 'intelligent'.
You are excluding this system (llm+harness) that learns (separately), can modify it's surrounding environment via a shell interface (including setting up a nightly training loop to reweight itself based on it's daily actions and interactions) from being intelligent. Do I have that right? Or are you thinking in terms of 'only' the LLM?
I do call openclaw style agents "living agents", although they might be closer to a kind of zombie. Living agents like openclaw et. al. do have a self-modifying property of sorts thanks to their memory, and so that system might be more AGI-ish, but, still, it has a fundamental cap to its potential, which remains frozen at the LLM.
> (including setting up a nightly training loop to reweight itself based on it's daily actions and interactions) from being intelligent
I'd have a harder time arguing that sort of system isn't AGI.
My point is learning may be required to create intelligence, but not 'run' intelligence. And LLM's 'learn' in their training, no? It happening at a different times doesn't truly matter.
What is doing the intelligencing though? Is it the LLM or the person training it?
To me, that seems awfully close to arguing that a puppet is intelligent because a human is pulling the strings and making it dance.
We can agree to disagree on this.
Agreed. The widespread anthropomorphizing is getting so tiring.
I blame it on the big companies in the space, but seeing intelligent folks regularly attributing intelligence to a complex autocomplete system is disappointing.
the color-codes make those benchmarks charts impossible to understand. very pretty though.
For what it's worth, the bars correspond in order with the legend. Plus there’s hover text.
They're not winning any popular benchmark. Is there some niche where it excels?
Well there are benchmarks, and there is real experience, right? They are not the same.