It never was "just predicting the next word", in that that was always a reductive argument about artifacts that are plainly more than what the phrase implies.
And also, they are still "just predicting the next word", literally in terms of how they function and are trained. And there are still cases where it's useful to remember this.
I'm thinking specifically of chat psychosis, where people go down a rabbit hole with these things, thinking they're gaining deep insights because they don't understand the nature of the thing they're interacting with.
They're interacting with something that does really good - but fallible - autocomplete based on 3 major inputs.
1) They are predicting the next word based on the pre-training data, internet data, which makes them fairly useful on general knowledge.
2) They are predicting the next word based on RL training data, which causes them to be able to perform conversational responses rather than autocomplete style responses, because they are autocompleting conversational data. This also causes them to be extremely obsequious and agreeable, to try to go along with what you give them and to try to mimic it.
3) They are autocompleting the conversation based on your own inputs and the entire history of the conversation. This, combined with 2), means you are, to a large extent, talking yourself, or rather something that is very adept at mimicing and going along with your inputs.
Who, or what, are you talking to when you interact with these? Something that predicts the next word, with varying accuracy, based on a corpus of general knowledge plus a corpus of agreeable question/answer format plus yourself. The general knowledge is great as long as it's fairly accurate, the sycophantic mirror of yourself sucks.
I always hear this "AI solved a crazy math problem that no math teams could solve" and think why can it never solve the math problems that I need it to solve when they should be easily doable by any high school student.
Because what it really means is "we directed the AI, it translated our ideas into Lean, the Lean tool then acted as an oracle determining if anything as incorrect, doing literally all the hard work of of correctness, and this process looped until the prompter gave up or Lean sent back an all clear"
I somewhat take issue with the second math example (the geometry problem); that is solvable routinely by computer algebra systems, and being able to translate problems into inputs, hit run and transcribe the proof back to English prose (which for all we know was what it did, since OpenAI and Google have confirmed their entrants received these tools which human candidates did not) is not so astonishing as the blog post makes it out to be
A few things I think are useful to emphasize on the training side, beyond what the article says:
1. Pre-training nowadays do not just use the 'next word/token' as a training signal, but also the next N words, because that appears to teach the model more generalized semantics and also bias towards 'thinking ahead' behaviors (gimme some rope here, i dont remember the precise way it should be articulated).
2. Regularizers during training, namely decay and (to a lesser extent) diversity. These do way more heavy-lifting than their simplicitly gives them credit for, they are the difference between memorizing entire paragraphs from a book and only taking away the core concepts.
3. Expert performance at non-knowledge tasks is mostly driven by RL and/or SFT over 'high quality' transcripts. The former cannot be described as 'predicting the next word', at least in terms of learning signal.
1. The Model Architecture. Calculation of outputs from inputs.
2. The Training Algorithm, that alters parameters in the architecture based on training data, often input, outputs vs. targets, but can be more complex than that. I.e. gradient descent, etc.
3. The Class of Problem being solved, i.e. approximation, prediction, etc.
4. The Instance of Problem of the problem being solved, i.e. approximation of chemical reaction completion vs. temperature, or prediction of textual responses.
5. The Data Embodiment of the problem, i.e. the actual data. How much, how complex, how general, how noisey, how accurate, how variable, how biased, ...?
And only after all those,
6. The Learned Algorithm that emerges from continual exposure to (5) in the basic form of (3), in order to specifically perform (4), by applying algorithm (2), to the model's parameters that control its input-output algorithm (1).
The latter, (6), has no limit in complexity or quality, or types of sub-problems, that must also be solved, to solve the umbrella problem successfully.
Data can be unbounded in complexity. Therefore, actual (successful) solutions are necessarily unbounded in complexity.
The "no limit, unbounded, any kind" of sub-problem part of (6) is missed by many people. To perform accurate predictions, of say the whole stock market, would require a model to learn everything from economic theory, geopolitics, human psychology, natural resources and their extraction, crime, electronic information systems and their optimizations, game theory, ...
That isn't a model I would call "just a stock price predictor".
Human language is an artifact created by complex beings. A high level of understanding of how those complex beings operate in conversation, writing, speeches, legal theory, their knowledge of 1000's of topics, psychologies, cultures, assumptions, motives, lifetime development, their modeling of each other, ... on and on ... becomes necessary to mimic general written artifacts between people with any resemblance at all.
LLM's, at the first point of being useful, were never "just" prediction machines.
I am still astonished there were ever technical people saying such a thing.
For or against, I don't know why the "just predicting" or "stochastic parrots" criticism was ever insightful. People make one word after another and frequently repeat phrases they heard elsewhere. It's kind of like criticizing a calculator for making one digit after another.
It isn’t a criticism; it’s a description of what the technology is.
In contrast, human thinking doesn’t involve picking a word at a time based on the words that came before. The mechanics of language can work that way at times - we select common phrasings because we know they work grammatically and are understood by others, and it’s easy. But we do our thinking in a pre-language space and then search for the words that express our thoughts.
I think kids in school ought to be made to use small, primitive LLMs so they can form an accurate mental model of what the tech does. Big frontier models do exactly the same thing, only more convincingly.
Fine, that would at least teach them that LLMs are doing a lot more than "predicting the next word" given that they can also be taught that a Markov model can do that and be about 10 lines of simple Python and use no neural nets or any other AI/ML technology.
> In contrast, human thinking doesn’t involve picking a word at a time based on the words that came before
Do we have science that demonstrates humans don't autoregressively emit words? (Genuinely curious / uninformed).
From the outset, its not obvious that auto-regression through the state space of action (i.e. what LLMs do when yeeting tokens) is the difference they have with humans. Though I can guess we can distinguish LLMs from other models like diffusion/HRM/TRM that explicitly refine their output rather than commit to a choice then run `continue;`.
Have you ever had a concept you wanted to express, known that there was a word for it, but struggled to remember what the word was?
For human thought and speech to work that way it must be fundamentally different to what an LLM does. The concept, the "thought", is separated from the word.
Analogies are all messy here, but I would compare the values of the residual stream to what you are describing as thought.
We force this residual stream to project to the logprobs of all tokens, just as a human in the act of speaking a sentence is forced to produce words. But could this residual stream represent thoughts which don't map to words?
Its plausible, we already have evidence that things like glitch-token representations trend towards the centroid of the high-dimensional latent space, and logprobs for tokens that represent wildly-branching trajectories in output space (i.e. "but" vs "exactly" for specific questions) represent a kind of cautious uncertainty.
> In contrast, human thinking doesn’t involve picking a word at a time based on the words that came before.
More to the point, human thinking isn't just outputting text by following an algorithm. Humans understand what each of those words actually mean, what they represent, and what it means when those words are put together in a given order. An LLM can regurgitate the wikipedia article on a plum. A human actually knows what a plum is and what it tastes like. That's why humans know that glue isn't a pizza topping and AI doesn't.
it's getting way better and we've to acknowledge how far we've come in last 4 years. Interestingly, one of the key examples of this is within vs code. AI is able to predict the next world not based on the generic trained data, but in the context of the repo (while manually editing)
It never was "just predicting the next word", in that that was always a reductive argument about artifacts that are plainly more than what the phrase implies.
And also, they are still "just predicting the next word", literally in terms of how they function and are trained. And there are still cases where it's useful to remember this.
I'm thinking specifically of chat psychosis, where people go down a rabbit hole with these things, thinking they're gaining deep insights because they don't understand the nature of the thing they're interacting with.
They're interacting with something that does really good - but fallible - autocomplete based on 3 major inputs.
1) They are predicting the next word based on the pre-training data, internet data, which makes them fairly useful on general knowledge.
2) They are predicting the next word based on RL training data, which causes them to be able to perform conversational responses rather than autocomplete style responses, because they are autocompleting conversational data. This also causes them to be extremely obsequious and agreeable, to try to go along with what you give them and to try to mimic it.
3) They are autocompleting the conversation based on your own inputs and the entire history of the conversation. This, combined with 2), means you are, to a large extent, talking yourself, or rather something that is very adept at mimicing and going along with your inputs.
Who, or what, are you talking to when you interact with these? Something that predicts the next word, with varying accuracy, based on a corpus of general knowledge plus a corpus of agreeable question/answer format plus yourself. The general knowledge is great as long as it's fairly accurate, the sycophantic mirror of yourself sucks.
[dead]
I always hear this "AI solved a crazy math problem that no math teams could solve" and think why can it never solve the math problems that I need it to solve when they should be easily doable by any high school student.
Because what it really means is "we directed the AI, it translated our ideas into Lean, the Lean tool then acted as an oracle determining if anything as incorrect, doing literally all the hard work of of correctness, and this process looped until the prompter gave up or Lean sent back an all clear"
I've not seen a lot of issues in math I've given LLMs. Maybe it's some artefact of your prompt?
I somewhat take issue with the second math example (the geometry problem); that is solvable routinely by computer algebra systems, and being able to translate problems into inputs, hit run and transcribe the proof back to English prose (which for all we know was what it did, since OpenAI and Google have confirmed their entrants received these tools which human candidates did not) is not so astonishing as the blog post makes it out to be
* LLMs, not AIs. AI has mostly never been about predicting the next words only.
A few things I think are useful to emphasize on the training side, beyond what the article says:
1. Pre-training nowadays do not just use the 'next word/token' as a training signal, but also the next N words, because that appears to teach the model more generalized semantics and also bias towards 'thinking ahead' behaviors (gimme some rope here, i dont remember the precise way it should be articulated).
2. Regularizers during training, namely decay and (to a lesser extent) diversity. These do way more heavy-lifting than their simplicitly gives them credit for, they are the difference between memorizing entire paragraphs from a book and only taking away the core concepts.
3. Expert performance at non-knowledge tasks is mostly driven by RL and/or SFT over 'high quality' transcripts. The former cannot be described as 'predicting the next word', at least in terms of learning signal.
Its generating an approximation of what it was trained on. Call it whatever you want, just not AGI or the road to AGI.
I mean the article pretty much confirms that ai is basically just predicting the next word.
It works well and can be used for a lot of things, but still.
6 Levels of algorithm people confuse:
1. The Model Architecture. Calculation of outputs from inputs.
2. The Training Algorithm, that alters parameters in the architecture based on training data, often input, outputs vs. targets, but can be more complex than that. I.e. gradient descent, etc.
3. The Class of Problem being solved, i.e. approximation, prediction, etc.
4. The Instance of Problem of the problem being solved, i.e. approximation of chemical reaction completion vs. temperature, or prediction of textual responses.
5. The Data Embodiment of the problem, i.e. the actual data. How much, how complex, how general, how noisey, how accurate, how variable, how biased, ...?
And only after all those,
6. The Learned Algorithm that emerges from continual exposure to (5) in the basic form of (3), in order to specifically perform (4), by applying algorithm (2), to the model's parameters that control its input-output algorithm (1).
The latter, (6), has no limit in complexity or quality, or types of sub-problems, that must also be solved, to solve the umbrella problem successfully.
Data can be unbounded in complexity. Therefore, actual (successful) solutions are necessarily unbounded in complexity.
The "no limit, unbounded, any kind" of sub-problem part of (6) is missed by many people. To perform accurate predictions, of say the whole stock market, would require a model to learn everything from economic theory, geopolitics, human psychology, natural resources and their extraction, crime, electronic information systems and their optimizations, game theory, ...
That isn't a model I would call "just a stock price predictor".
Human language is an artifact created by complex beings. A high level of understanding of how those complex beings operate in conversation, writing, speeches, legal theory, their knowledge of 1000's of topics, psychologies, cultures, assumptions, motives, lifetime development, their modeling of each other, ... on and on ... becomes necessary to mimic general written artifacts between people with any resemblance at all.
LLM's, at the first point of being useful, were never "just" prediction machines.
I am still astonished there were ever technical people saying such a thing.
For or against, I don't know why the "just predicting" or "stochastic parrots" criticism was ever insightful. People make one word after another and frequently repeat phrases they heard elsewhere. It's kind of like criticizing a calculator for making one digit after another.
It isn’t a criticism; it’s a description of what the technology is.
In contrast, human thinking doesn’t involve picking a word at a time based on the words that came before. The mechanics of language can work that way at times - we select common phrasings because we know they work grammatically and are understood by others, and it’s easy. But we do our thinking in a pre-language space and then search for the words that express our thoughts.
I think kids in school ought to be made to use small, primitive LLMs so they can form an accurate mental model of what the tech does. Big frontier models do exactly the same thing, only more convincingly.
Fine, that would at least teach them that LLMs are doing a lot more than "predicting the next word" given that they can also be taught that a Markov model can do that and be about 10 lines of simple Python and use no neural nets or any other AI/ML technology.
> In contrast, human thinking doesn’t involve picking a word at a time based on the words that came before
Do we have science that demonstrates humans don't autoregressively emit words? (Genuinely curious / uninformed).
From the outset, its not obvious that auto-regression through the state space of action (i.e. what LLMs do when yeeting tokens) is the difference they have with humans. Though I can guess we can distinguish LLMs from other models like diffusion/HRM/TRM that explicitly refine their output rather than commit to a choice then run `continue;`.
Have you ever had a concept you wanted to express, known that there was a word for it, but struggled to remember what the word was? For human thought and speech to work that way it must be fundamentally different to what an LLM does. The concept, the "thought", is separated from the word.
Analogies are all messy here, but I would compare the values of the residual stream to what you are describing as thought.
We force this residual stream to project to the logprobs of all tokens, just as a human in the act of speaking a sentence is forced to produce words. But could this residual stream represent thoughts which don't map to words?
Its plausible, we already have evidence that things like glitch-token representations trend towards the centroid of the high-dimensional latent space, and logprobs for tokens that represent wildly-branching trajectories in output space (i.e. "but" vs "exactly" for specific questions) represent a kind of cautious uncertainty.
> In contrast, human thinking doesn’t involve picking a word at a time based on the words that came before.
More to the point, human thinking isn't just outputting text by following an algorithm. Humans understand what each of those words actually mean, what they represent, and what it means when those words are put together in a given order. An LLM can regurgitate the wikipedia article on a plum. A human actually knows what a plum is and what it tastes like. That's why humans know that glue isn't a pizza topping and AI doesn't.
> That's why humans know that glue isn't a pizza topping and AI doesn't.
It's the opposite. That came from a Google AI summary which was forced to quote a reddit post, which was written by a human.
[dead]
This article needs to be put through a summarizer
it's getting way better and we've to acknowledge how far we've come in last 4 years. Interestingly, one of the key examples of this is within vs code. AI is able to predict the next world not based on the generic trained data, but in the context of the repo (while manually editing)