This is written by someone who's not an AI researcher, working with tiny models on toy datasets. It's at the level of a motivated undergraduate student in their first NLP course, but not much more.
If one can easily reach parity with a motivated undergrad by leveraging LLMs I will still consider it impressive.
While the 5-minutes model will never be useful in itself it lays the groundwork for amateurs and small groups to getting into developing small models. There's at the moment another HN headline hyping up a tiny model that scores impressively at the arc-agi benchmarks so it's clearly not a dead end to explore what is "household-affordable" models.
Though an approach that doesn't lean on the authors $200/month OAI sub would've been more interesting to follow.
Right now AI is lifting up the floor, so if you don't know programming, mandarin or any other topic it will sure do better than you. (Vibe coding goes here)
Same for tasks you know how to do but AI does them faster, there is also value there. (Claude Code used by a senior goes here)
The interesting thing is when AI is lifting up the ceiling everywhere, but maybe then is when we are almost on AGI territory.
>The interesting thing is when AI is lifting up the ceiling everywhere, but maybe then is when we are almost on AGI territory.
I feel like that's the only way this insanely high bet on AI could possibly pay off - if AGI/"ceiling lifting" is achieved then maybe all this could be the trillion+ bet they expect. If it doesn't really progress much beyond where it is now, they're in trouble.
I’ve been putting questions into LLM research functions, including Claude’s research mode, and letting them churn until a report appears.
I’ve been starting with topics where I’m already familiar with the answer but want a refreshed. So far, I’m not impressed. Some times the info will be correct. Most of the time it strings together a lot of words from the material it finds but it reads like an undergrad trying to paraphrase the Wikipedia page without understanding the content. Often it will have one bullet point that is completely wrong.
The other problem I’m having is that it’s not very good at identifying poor sources. This is less of a problem with topics like math and engineering, but a big problem with topics like health and medicine where it will pick up alternative medicine and pseudoscience pages and integrate them into the research as if they were real. There are a lot of health and medicine topics where the way pseudoscience people talk about a subject doesn’t match the real science, but they use the same words and therefore catch the same search terms.
An example is the way “dopamine” is used in casual conversation and by influencers in ways that aren’t accurate. Concepts like “dopamine fasting” or claiming things “raise your dopamine” aren’t scientifically accurate but use the same words nevertheless and therefore can get pulled into the training set and searches.
There are basically three types of responses you can get from an LLM/agent:
1) A response originating from LLM pre-training, in a domain where there has not been any (successful) Rl-for-reasoning post-training. In this case the amount of reasoning around the raw facts "recalled" by the LLM is going to be limited by any reasoning present in the training data.
2) A non-agentic response in a domain like Math Olmypiad problems where the LLM was post-trained with RL to encourage reasoning mirroring this RL training set. This type of domain-specific reasoning training seems to have little benefit to other domains (although in the early LLM days it was said that training on computer code did provide some general benefit).
3) An agentic response, such as from one of these research systems, where it seems the agent is following some sort of generic research / summarization template with proscribed steps. I've never tried these myself, but it seems they can be quite successful in deep diving and gathering relevant source material, but then the ability to reason over this retrieved material is going to come down to the reasoning capability of the underlying model per 1) and 2) above.
Bottom line would seem to be that with today's systems domain specific reasoning capability largely comes down to RL post-training for reasoning in that specific domain, resulting in what some call "jagged" performance - excellent in some areas and very poor in others. Demis Hassabis, for one, seems to be saying that this will not be fixed until architectural changes/additions are made to bring us closer to AGI.
Claude's research mode is by far the worst one I've used, I consider it nearly useless. I cannot trust it specifically because Anthropic has a policy of refusing to use reddit for anything, whether it be as a research source or in claude chrome.
Reddit may not be the greatest source for hard science, but for things like "tell me what shoes people are finding helpful for their plantar fasciitis" I appreciate reddit's anecdata of reddit over every other source.
> I cannot trust it specifically because Anthropic has a policy of refusing to use reddit for anything, whether it be as a research source or in claude chrome.
Reddit wants money for its users' data. Is the reason Anthropic doesn't want to pay Reddit's shareholders for it?
Also Sam Altman owns quite a lot of Reddit stock and was briefly the CEO, so it's not inconceivable he's influenced them not to cooperate with one of his chief rivals.
I love using AI to set up projects in 5 minutes but I hate to develop these projects using AI because inevitably it runs into a wall and I need to guide it and fix its code.
I suppose in this case it picked up an existing project and DIDN’T walk off a cliff? Were the mutations really small?
The worst to me is trying to use someone else's AI code. This guy, for example, has thousands of lines of undocumented (there are a few docstrings but not much more) code. I'm not really motivated to go through that, and you can't trust AI to go through and document it correctly necessarily either. At least if I generated it, I was probably around for enough of the process to know how most of it works.
I’ve had the exact same experience. I’ve been vibe coding most of my research now, previously was an MLE handcrafting model code.
A lot of negative comments on here, which seems to always be the case with HN and vibe coding. The reality is that it’s actually starting to work, quite well.
> I’ve had the exact same experience. I’ve been vibe coding most of my research now, previously was an MLE handcrafting model code.
What happened to the MLE? Are they all going to end up that way?
> A lot of negative comments on here, which seems to always be the case with HN and vibe coding. The reality is that it’s actually starting to work, quite well.
It's hard to be positive about the idea of your skills getting devalued and getting kicked to the curb.
>It's hard to be positive about the idea of your skills getting devalued and getting kicked to the curb.
I think it depends where people build their own identities in the value stream. Do you see yourself as a product/hacker type person and writing code just is a blocker on delivering your vision? Building greenfield prototypes is now 100x easier! Do you see yourself as a craftsperson that brings years of experience to hard technical challenges? Some folks see AI as an attack, and some see it as a way to remove some drudgery while they focus on harder problems. It is about mindset.
>> It's hard to be positive about the idea of your skills getting devalued and getting kicked to the curb.
> I think it depends where people build their own identities in the value stream. Do you see yourself as.... It is about mindset.
No. You're missing the point in a pretty serious way. In large part, we're talking about lower levels on the hierarchy of needs than that.
> Some folks see AI as an attack, and some see it as a way to remove some drudgery while they focus on harder problems.
Some folks see it as a way to remove hard problems so they can focus on drudgery. Do you love to code, but hate code reviews? Guess what you get to do more of now!
> Building greenfield prototypes is now 100x easier!
And then the boss-man can use his MBA, fire 90% of the team, pocket their wages, and get 10x the prototypes. Repeat that through the economy. Your teammates have 20 years to retirement and can't pay their mortgage. "Progress!," but for whom?
I suppose you would need to weigh the utility points lost from people no longer enjoying their jobs with the positive utility points gained by the consumers of products and services.
My intuition is that with the law of leverage in mind, the former would be relatively low and the latter would be relatively high.
It is of course up to the government and culture to minimize the former and maximize the latter.
So why is "distilling from N-gram" better, why does it make the transformer learn English faster?
Hypothesis: it's the standard "teacher-student" or "distillation" trick - if you're learning next-token-prediction, you only learn what the correct answer is (i.e. the spike in probability), but when you're distilling from a teacher model, you learn the entire distribution of potential answers.
Curious, can anyone more experienced in AI research comment on this?
> OpenAI has released GPT-5-codex, and supposedly uses it ... to automate a lot of their ... AI research
If I was the owner of an AI company that was forever trying to juice its valuation and raise money, you can bet I'd be telling people I had built a magic self-improving AI.
Nah I think the more obvious thing is that most of the code they’re writing (by sheer volume) is the same stuff most programmers write (code that does telemetry, APIs, usage models, billing, product features, react widgets, etc etc.) stuff that the AI coding models do really well.
And even with pure model development, making incremental changes to try different strategies in notebooks etc are probably not that hard to write, when given clear instructions by a data scientist, etc. (I’m not saying these disciplines are easy, I’m saying that a data scientist could more easily describe what they want)
Or the devops stuff. Or the RL UIs.
All that stuff is run of the mill software in service of building the models. And it can be vibe coded.
OK, but I probably don't want to read a blog post about a dramatic rendering of your grocery store run. Everyone has their $200 a month tale to tell these days, and many if not most of those people have little to say about anything else.
Ah, another weekly AI booster article from the AI booster gang.
Hate to break it to you, but if GPT-5 is better AI researcher than you, you were probably not that good to begin with.
Does Codex tell you why 95% of AI projects in the enterprise fail? Or why the only study up to date on merits of AI for coding shows 19% decrease of productivity.
The author wasn't doing "AI research" before and neither was GPT5. This is not at the frontier of anything, it is just an already solved problem in training that GPT5 found. Had the author been willing to actually do a Google and GitHub search, or just twiddle the training knob parameters enough on their own, they would have found a better solution than working alone.
Also this footnote:
> Alone” here is relative - I did use ChatGPT and a bit of Copilot to generate some of the training code in my last attempt. I just didn’t use any agentic tool
I have no words. I wonder if this "AI researcher" can make it through the original Attention Is All You Need paper without an LLM.
I don't think he claims to be an "AI researcher". His CV has:
>I built significant pieces of the Copilot onboarding, purchasing, billing and settings flow. For eight months I headed up the Copilot anti-abuse effort. I then led the launch of GitHub Models, and am now working on other Copilot projects.
As an aside I had a look at GitHub Models and it was quite interesting - you can try the API for a number of models for free using your GitHub login.
The article is titled "GPT-5-Codex is a better AI researcher than me", so I think it's easy to think he's referring to himself as an AI researcher. He does say "I don’t have any illusions about this making me a real AI researcher", but it's the very end.
This is written by someone who's not an AI researcher, working with tiny models on toy datasets. It's at the level of a motivated undergraduate student in their first NLP course, but not much more.
If one can easily reach parity with a motivated undergrad by leveraging LLMs I will still consider it impressive.
While the 5-minutes model will never be useful in itself it lays the groundwork for amateurs and small groups to getting into developing small models. There's at the moment another HN headline hyping up a tiny model that scores impressively at the arc-agi benchmarks so it's clearly not a dead end to explore what is "household-affordable" models.
Though an approach that doesn't lean on the authors $200/month OAI sub would've been more interesting to follow.
You can also reach research parity by downloading a Github repository. Is that impressive too?
Downloading a file is not equivalent to having high level abstractified control over running software.
And if it is then I'm a farmer because I bought potatoes from the store.
Right now AI is lifting up the floor, so if you don't know programming, mandarin or any other topic it will sure do better than you. (Vibe coding goes here)
Same for tasks you know how to do but AI does them faster, there is also value there. (Claude Code used by a senior goes here)
The interesting thing is when AI is lifting up the ceiling everywhere, but maybe then is when we are almost on AGI territory.
>The interesting thing is when AI is lifting up the ceiling everywhere, but maybe then is when we are almost on AGI territory.
I feel like that's the only way this insanely high bet on AI could possibly pay off - if AGI/"ceiling lifting" is achieved then maybe all this could be the trillion+ bet they expect. If it doesn't really progress much beyond where it is now, they're in trouble.
We already have a lifted ceiling.
We have several synthetic datasets and automated evaluation options for such things that were close to impossible to do before LLMs.
The title makes it sound like we have reached the singularity. The real insight here is that amateurs may have a difficult time competing with AI
I agree but if this year the AI can keep up with amateurs, next year who knows?
If you got married this month you will have 12 wives in a year, who knows?
https://xkcd.com/605/
I’ve been putting questions into LLM research functions, including Claude’s research mode, and letting them churn until a report appears.
I’ve been starting with topics where I’m already familiar with the answer but want a refreshed. So far, I’m not impressed. Some times the info will be correct. Most of the time it strings together a lot of words from the material it finds but it reads like an undergrad trying to paraphrase the Wikipedia page without understanding the content. Often it will have one bullet point that is completely wrong.
The other problem I’m having is that it’s not very good at identifying poor sources. This is less of a problem with topics like math and engineering, but a big problem with topics like health and medicine where it will pick up alternative medicine and pseudoscience pages and integrate them into the research as if they were real. There are a lot of health and medicine topics where the way pseudoscience people talk about a subject doesn’t match the real science, but they use the same words and therefore catch the same search terms.
An example is the way “dopamine” is used in casual conversation and by influencers in ways that aren’t accurate. Concepts like “dopamine fasting” or claiming things “raise your dopamine” aren’t scientifically accurate but use the same words nevertheless and therefore can get pulled into the training set and searches.
There are basically three types of responses you can get from an LLM/agent:
1) A response originating from LLM pre-training, in a domain where there has not been any (successful) Rl-for-reasoning post-training. In this case the amount of reasoning around the raw facts "recalled" by the LLM is going to be limited by any reasoning present in the training data.
2) A non-agentic response in a domain like Math Olmypiad problems where the LLM was post-trained with RL to encourage reasoning mirroring this RL training set. This type of domain-specific reasoning training seems to have little benefit to other domains (although in the early LLM days it was said that training on computer code did provide some general benefit).
3) An agentic response, such as from one of these research systems, where it seems the agent is following some sort of generic research / summarization template with proscribed steps. I've never tried these myself, but it seems they can be quite successful in deep diving and gathering relevant source material, but then the ability to reason over this retrieved material is going to come down to the reasoning capability of the underlying model per 1) and 2) above.
Bottom line would seem to be that with today's systems domain specific reasoning capability largely comes down to RL post-training for reasoning in that specific domain, resulting in what some call "jagged" performance - excellent in some areas and very poor in others. Demis Hassabis, for one, seems to be saying that this will not be fixed until architectural changes/additions are made to bring us closer to AGI.
Claude's research mode is by far the worst one I've used, I consider it nearly useless. I cannot trust it specifically because Anthropic has a policy of refusing to use reddit for anything, whether it be as a research source or in claude chrome.
Reddit may not be the greatest source for hard science, but for things like "tell me what shoes people are finding helpful for their plantar fasciitis" I appreciate reddit's anecdata of reddit over every other source.
> I cannot trust it specifically because Anthropic has a policy of refusing to use reddit for anything, whether it be as a research source or in claude chrome.
Reddit wants money for its users' data. Is the reason Anthropic doesn't want to pay Reddit's shareholders for it?
Also Sam Altman owns quite a lot of Reddit stock and was briefly the CEO, so it's not inconceivable he's influenced them not to cooperate with one of his chief rivals.
AFAIK it’s Google that has signed a license deal with Reddit.
I love using AI to set up projects in 5 minutes but I hate to develop these projects using AI because inevitably it runs into a wall and I need to guide it and fix its code.
I suppose in this case it picked up an existing project and DIDN’T walk off a cliff? Were the mutations really small?
The worst to me is trying to use someone else's AI code. This guy, for example, has thousands of lines of undocumented (there are a few docstrings but not much more) code. I'm not really motivated to go through that, and you can't trust AI to go through and document it correctly necessarily either. At least if I generated it, I was probably around for enough of the process to know how most of it works.
That's weird. I thought LLMs loved over-explaining their code?
I’ve had the exact same experience. I’ve been vibe coding most of my research now, previously was an MLE handcrafting model code.
A lot of negative comments on here, which seems to always be the case with HN and vibe coding. The reality is that it’s actually starting to work, quite well.
> I’ve had the exact same experience. I’ve been vibe coding most of my research now, previously was an MLE handcrafting model code.
What happened to the MLE? Are they all going to end up that way?
> A lot of negative comments on here, which seems to always be the case with HN and vibe coding. The reality is that it’s actually starting to work, quite well.
It's hard to be positive about the idea of your skills getting devalued and getting kicked to the curb.
>It's hard to be positive about the idea of your skills getting devalued and getting kicked to the curb.
I think it depends where people build their own identities in the value stream. Do you see yourself as a product/hacker type person and writing code just is a blocker on delivering your vision? Building greenfield prototypes is now 100x easier! Do you see yourself as a craftsperson that brings years of experience to hard technical challenges? Some folks see AI as an attack, and some see it as a way to remove some drudgery while they focus on harder problems. It is about mindset.
What skills do you value?
>> It's hard to be positive about the idea of your skills getting devalued and getting kicked to the curb.
> I think it depends where people build their own identities in the value stream. Do you see yourself as.... It is about mindset.
No. You're missing the point in a pretty serious way. In large part, we're talking about lower levels on the hierarchy of needs than that.
> Some folks see AI as an attack, and some see it as a way to remove some drudgery while they focus on harder problems.
Some folks see it as a way to remove hard problems so they can focus on drudgery. Do you love to code, but hate code reviews? Guess what you get to do more of now!
> Building greenfield prototypes is now 100x easier!
And then the boss-man can use his MBA, fire 90% of the team, pocket their wages, and get 10x the prototypes. Repeat that through the economy. Your teammates have 20 years to retirement and can't pay their mortgage. "Progress!," but for whom?
I suppose you would need to weigh the utility points lost from people no longer enjoying their jobs with the positive utility points gained by the consumers of products and services.
My intuition is that with the law of leverage in mind, the former would be relatively low and the latter would be relatively high.
It is of course up to the government and culture to minimize the former and maximize the latter.
So why is "distilling from N-gram" better, why does it make the transformer learn English faster?
Hypothesis: it's the standard "teacher-student" or "distillation" trick - if you're learning next-token-prediction, you only learn what the correct answer is (i.e. the spike in probability), but when you're distilling from a teacher model, you learn the entire distribution of potential answers.
Curious, can anyone more experienced in AI research comment on this?
> OpenAI has released GPT-5-codex, and supposedly uses it ... to automate a lot of their ... AI research
If I was the owner of an AI company that was forever trying to juice its valuation and raise money, you can bet I'd be telling people I had built a magic self-improving AI.
Nah I think the more obvious thing is that most of the code they’re writing (by sheer volume) is the same stuff most programmers write (code that does telemetry, APIs, usage models, billing, product features, react widgets, etc etc.) stuff that the AI coding models do really well.
And even with pure model development, making incremental changes to try different strategies in notebooks etc are probably not that hard to write, when given clear instructions by a data scientist, etc. (I’m not saying these disciplines are easy, I’m saying that a data scientist could more easily describe what they want)
Or the devops stuff. Or the RL UIs.
All that stuff is run of the mill software in service of building the models. And it can be vibe coded.
I agree in general, but I'm not sure I'd call that research. Unless of course I was the owner of an AI company etc etc
It's like reading a self-written obituary of someone who stopped thinking for themselves.
They aren't excited about anything. They aren't in awe. They haven't done any hard work. They're just here to ooze lukewarm sludge
Now do the same for people who buy their food at the grocery store instead of growing it themselves like our ancestors did.
OK, but I probably don't want to read a blog post about a dramatic rendering of your grocery store run. Everyone has their $200 a month tale to tell these days, and many if not most of those people have little to say about anything else.
While the headline is correct, it speaks more about this particular human's expertise than the AI's.
[dead]
Ah, another weekly AI booster article from the AI booster gang.
Hate to break it to you, but if GPT-5 is better AI researcher than you, you were probably not that good to begin with.
Does Codex tell you why 95% of AI projects in the enterprise fail? Or why the only study up to date on merits of AI for coding shows 19% decrease of productivity.
The author wasn't doing "AI research" before and neither was GPT5. This is not at the frontier of anything, it is just an already solved problem in training that GPT5 found. Had the author been willing to actually do a Google and GitHub search, or just twiddle the training knob parameters enough on their own, they would have found a better solution than working alone.
Also this footnote:
> Alone” here is relative - I did use ChatGPT and a bit of Copilot to generate some of the training code in my last attempt. I just didn’t use any agentic tool
I have no words. I wonder if this "AI researcher" can make it through the original Attention Is All You Need paper without an LLM.
I don't think he claims to be an "AI researcher". His CV has:
>I built significant pieces of the Copilot onboarding, purchasing, billing and settings flow. For eight months I headed up the Copilot anti-abuse effort. I then led the launch of GitHub Models, and am now working on other Copilot projects.
As an aside I had a look at GitHub Models and it was quite interesting - you can try the API for a number of models for free using your GitHub login.
The article is titled "GPT-5-Codex is a better AI researcher than me", so I think it's easy to think he's referring to himself as an AI researcher. He does say "I don’t have any illusions about this making me a real AI researcher", but it's the very end.