101 comments

  • Sohcahtoa82 13 hours ago ago

    A Markov Chain trained by only a single article of text will very likely just regurgitate entire sentences straight from the source material. There just isn't enough variation in sentences.

    But then, Markov Chains fall apart when the source material is very large. Try training a chain based on Wikipedia. You'll find that the resulting output becomes incoherent garbage. Increasing the context length may increase coherence, but at the cost of turning into just simple regurgitation.

    In addition to the "attention" mechanism that another commenter mentioned, it's important to note that Markov Chains are discrete in their next token prediction while an LLM is more fuzzy. LLMs have latent space where the meaning of a word basically exists as a vector. LLMs will generate token sequences that didn't exist in the source material, whereas Markov Chains will ONLY generate sequences that existed in the source.

    This is why it's impossible to create a digital assistant, or really anything useful, via Markov Chain. The fact that they only generate sequences that existed in the source mean that it will never come up with anything creative.

    • thfuran 12 hours ago ago

      >Markov Chains will ONLY generate sequences that existed in the source.

      A markov chain of order N will only generate sequences of length N+1 that were in the training corpus, but it is likely to generate sequences of length N+2 that weren't (unless N was too large for the training corpus and it's degenerate).

      • Sohcahtoa82 12 hours ago ago

        Well yeah, but N+2 but the generation of the +2 loses the first part of N.

        If you use a context window of 2, then yes, you might know that word C can follow words A and B, and D can follow words B and C, and therefore generate ABCD even if ABCD never existed.

        But it could be that ABCD is incoherent.

        For example, if A = whales, B = are, C = mammals, D = reptiles.

        "Whales are mammals" is fine, "are mammals reptiles" is fine, but "Whales are mammals reptiles" is incoherent.

        The longer you allow the chain to get, the more incoherent it becomes.

        "Whales are mammals that are reptiles that are vegetables too".

        Any 3-word fragment of that sentence is fine. But put it together, and it's an incoherent mess.

        • Y_Y 8 hours ago ago

          That are reptiles!

      • Isamu 12 hours ago ago

        Right, you can generate long sentences from a first-order markov model, and all of the transitions from one word to the next be in the training set but the full generated sentence may not.

      • Jensson 4 hours ago ago

        > A markov chain of order N will only generate sequences of length N+1 that were in the training corpus

        Depends on how you trained it, an LLM is also a markov chain.

    • johnisgood 12 hours ago ago

      > The fact that they only generate sequences that existed in the source mean that it will never come up with anything creative.

      I have seen the argument that LLMs can only give you what its been trained on, i.e. it will not be "creative" or "revolutionary", that it will not output anything "new", but "only what is in its corpus".

      I am quite confused right now. Could you please help me with this?

      Somewhat related: I like the work of David Hume, and he explains it quite well how we can imagine various creatures, say, a pig with a dragon head, even if we have not seen one ANYWHERE. It is because we can take multiple ideas and combine them together. We know how dragons typically look like, and we know how a pig looks like, and so, we can imagine (through our creativity and combination of these two ideas) how a pig with a dragon head would look like. I wonder how this applies to LLMs, if they even apply.

      Edit: to clarify further as to what I want to know: people have been telling me that LLMs cannot solve problems that is not in their training data already. Is this really true or not?

      • jldugger 12 hours ago ago

        Well, there's kind of two answers here:

        1. To the extent that creativity is randomness, LLM inference samples from the token distribution at each step. It's possible (but unlikely!) for an LLM to complete "pig with" with the token sequence "a dragon head" just by random chance. The temperature settings commonly exposed control how often the system takes the most likely candidate tokens.

        2. A markov chain model will literally have a matrix entry for every possible combination of inputs. So a 2 degree chain will have n^2 weights, where N is the number of possible tokens. In that situation "pig with" can never be completed with a brand new sentence, because those have literal 0's in the probability. In contrast, transformers consider huge context windows, and start with random weights in huge neural network matrices. What people hope happens is that the NN begins to represent ideas, and connections between them. This gives them a shot at passing "out of distribution" tests, which is a cornerstone of modern AI evaluation.

        • thesz 2 hours ago ago

            > A markov chain model will literally have a matrix entry for every possible combination of inputs.
          
          The less frequent prefixes are usually pruned away and there is a penalty score to add to go to the shorter prefix. In the end, all words are included into the model's prediction and typical n-gram SRILM model is able to generate "the pig with dragon head," also with small probability.

          Even if you think about Markov Chain information as a tensor (not matrix), the computation of probabilities is not a single lookup, but a series of folds.

        • theGnuMe 5 hours ago ago

          You can have small epsilons instead of zeros.

          • ruined 2 hours ago ago

            what, for all possible words?

            • 3eb7988a1663 2 hours ago ago

              Instead of a naive dense matrix, you can use some implementation that allows sparsity. If element does not exist, gets a non-zero value which can still be sampled. Which theoretically enables all outputs.

        • otabdeveloper4 an hour ago ago

          Re point 1: no, "temperature" is not an inherent property of LLM's.

          The big cloud providers use the "temperature" setting because having the assistant repeat to you the exact same output sequence exposes the man behind the curtain and breaks suspension of disbelief.

          But if you run the LLM yourself and you want the best quality output, then turning off "temperature" entirely makes sense. That's what I do.

          (The downside is that the LLM can then, rarely, get stuck in infinite loops. Again, this isn't a big deal unless you really want to persist with the delusion that an LLM is a human-like assistant.)

      • pama 11 hours ago ago

        LLMs have the ability to learn certain classes of algorithms from their datasets in order to reduce errors when compressing their pretraining data. If you are technically inclined, read the reference: https://arxiv.org/abs/2208.01066 (optionally followup work) to see how llms can pick up complicated algorithms from training on examples that could have been generated by such algorithms (in one of the cases the LLM is better than anything we know; in the rest it is simply just as good as our best algos). Learning such functions from data would not work with Markov chains at any level of training. The LLMs in this study are tiny. They are not really learning a language, but rather how to perform regression.

        • thesz an hour ago ago

          Transformers are performing (soft, continuous) beam search inside them, the width of beam being not bigger than number of k-v pairs in attention mechanism.

          In my experience, having a Markov Chain to be equipped with the beam search greatly improve MC's predictive power, even if Markov Chain is ARPA 3-gram model, heavily pruned.

          What is more, Markov Chains are not restricted to immediate prefixes, you can use skip grams as well. How to use them and how to mix them into a list of probabilities is shown in the paper on Sparse Non-negative Matrix Language Modeling [1].

          [1] https://aclanthology.org/Q16-1024/

          I think I should look into that link of yours later. Have slimmed over it, I should say it... smells interesting at some places. For one example, decision trees learning is performed with greedy algorithm which, I believe, does not use oblique splits whereas transformers inherently learn oblique splits.

      • koliber 12 hours ago ago

        Here's how I see it, but I'm not sure how valid my mental model is.

        Imagine a source corpus that consists of:

        Cows are big. Big animals are happy. Some other big animals include pigs, horses, and whales.

        A Markov chain can only return verbatim combinations. So it might return "Cows are big animals" or "Are big animals happy".

        An LLM can get a sense of meaning in these words and can return ideas expressed in the input corpus. So in this case it might say "Pigs and horses are happy". It's not limited to responding with verbatim sequences. It can be seen as a bit more creative.

        However, LLMs will not be able to represent ideas that it has not encountered before. It won't be able to come up with truly novel concepts, or even ask questions about them. Humans (some at least) have that unbounded creativity that LLMs do not.

        • vidarh 11 hours ago ago

          > However, LLMs will not be able to represent ideas that it has not encountered before. It won't be able to come up with truly novel concepts, or even ask questions about them. Humans (some at least) have that unbounded creativity that LLMs do not.

          There's absolutely no evidence to support this claim. It'd require humans to exceed the Turing computable, and we have no evidence that is possible.

          • somenameforme an hour ago ago

            Turing computability is tangential to his claim, as LLMs are obviously not carrying out the extent of all computable concepts. His claim trivially be proven by considering the history of humanity. We went from a starting point of having literally no language whatsoever, and technology that would not have expanded much beyond an understanding of 'poke him with the pointy side'. And from there we would go on to discover the secrets of the atom, put a man on the Moon, and more. To say nothing of inventing language itself.

            An LLM trained on this starting state of humanity obviously is never going to do anything except remix basically nothing. It's obviously never going to discover the secrets of the atom, or how to put a man on the Moon. Now whether any artificial device could achieve what humans did is where the question of computability comes into play, and that's a much more interesting one. But if we limit ourselves to LLMs, then this is very straight forward to answer.

          • koliber 11 hours ago ago

            If you tell me that trees are big, and trees are made of hard wood, I as a human am capable of asking whether trees feel pain. I don't think what you said is false and I am not familiar with computational theory to be able to debate it. People occasionally have novel creative insights that do not derive from past experience or knowledge, and that is what I think of when I think of creativity.

            Humans created novel concepts like writing literally out of thin air. I like how the book "Guns, Steels, and Germs" describes that novel creative process and contrasts it via a disseminative derivation process.

            • vidarh 7 hours ago ago

              > People occasionally have novel creative insights that do not derive from past experience or knowledge, and that is what I think of when I think of creativity.

              If they are not derived from past experience or knowledge, then unless humans exceed the Turing computable, they would need to be the result of randomness in one form or other. There's absolutely no reason why an LLM can not do that. The only reason a far "dumber" pure random number generator based string generator "can't" do that is because it would take too long to chance on something coherent, but it most certainly would keep spitting out novel things. The only difference is how coherent the novel things are.

              • Jensson 4 hours ago ago

                > If they are not derived from past experience or knowledge

                Every animal is born with intuition, you missed that part.

            • c22 6 hours ago ago

              Wouldn't this insight derive from many past experiences of feeling pain yourself and the knowledge that others feel it too?

        • marcellus23 12 hours ago ago

          > A Markov chain can only return verbatim combinations. So it might return "Cows are big animals" or "Are big animals happy".

          Just for my own edification, do you mean "Are big animals are happy"? "animals happy" never shows up in the source text so "happy" would not be a possible successor to "animals", correct?

        • fragmede 10 hours ago ago

          > However, LLMs will not be able to represent ideas that it has not encountered before.

          Sure they do. We call them hallucinations and complain that they're not true, however.

          • koliber 9 hours ago ago

            Hmmm. Didn't think about that.

            In people there is a difference between unconscious hallucinations vs. intentional creativity. However, there might be situations where they're not distinguishable. In LLMs, it's hard to talk about intentionality.

            I love where you took this.

            • gishh 7 hours ago ago

              A hallucination isn’t a creative new idea, it’s blatantly wrong information, provably.

              If an LLM had actual intellectual ability it could tell “us” how we can improve models. They can’t. They’re literally defined by their token count and they use statistics to generate token chains.

              They’re as creative as the most statistically relevant token chains they’ve been trained on by _people_ who actually used intelligence to type words on a keyboard.

      • hugkdlief 10 hours ago ago

        > we can imagine various creatures, say, a pig with a dragon head, even if we have not seen one ANYWHERE. It is because we can take multiple ideas and combine them together.

        Funny choice of combination, pig and dragon, since Leonardo Da Vinci famously imagined dragons themselves by combining lizards and cats: https://i.pinimg.com/originals/03/59/ee/0359ee84595586206be6...

        • johnisgood 10 hours ago ago

          Hah, interesting. Pig and dragon just sort of came to mind as I was writing the comment. :D But we can pretty much imagine anything, can't we? :)

          I should totally try to generate images using AI with some of these prompts!

      • umanwizard 12 hours ago ago

        > I have seen the argument that LLMs can only give you what its been trained on, i.e. it will not be "creative" or "revolutionary", that it will not output anything "new", but "only what is in its corpus".

        People who claim this usually don’t bother to precisely (mathematically) define what they actually mean by those terms, so I doubt you will get a straight answer.

        • DaiPlusPlus 10 minutes ago ago

          How can anyone "mathematically" define "revolutionary"?

      • andoando 10 hours ago ago

        That little quip from Hume has influenced my thinking so much Im happy to see it again

      • Sohcahtoa82 12 hours ago ago

        > I have seen the argument that LLMs can only give you what its been trained on, i.e. it will not be "creative" or "revolutionary", that it will not output anything "new", but "only what is in its corpus".

        LLMs can absolutely create things that are creative, at least for some definition of "creative".

        For example, I can ask an LLM to create a speech about cross-site scripting the style of Donald Trump:

        > Okay, folks, we're talking about Cross-Site Scripting, alright? I have to say, it's a bit confusing, but let's try to understand it. They call it XSS, which is a fancy term. I don't really know what it means, but I hear it's a big deal in the tech world. People are talking about it, a lot of people, very smart people. So, Cross-Site Scripting. It's got the word "scripting" in it, which sounds like it's about writing, maybe like a script for a movie or something. But it's on the internet, on these websites, okay? And apparently, it's not good. I don't know exactly why, but it's not good. Bad things happen, they tell me. Maybe it makes the website look different, I don't know. Maybe it makes things pop up where they shouldn't. Could be anything! But here's what I do know. We need to do something about it. We need to get the best people, the smartest people, to look into it. We'll figure it out, folks. We'll make our websites safe, and we'll do it better than anyone else. Trust me, it'll be tremendous. Thank you.

        Certainly there's no text out there that contains a speech about XSS from Trump. There's some snippets here and there that likely sound like Trump, but a Markov Chain simply is incapable of producing anything like this.

        • 0cf8612b2e1e 9 hours ago ago

          Sure that specific text does not exist, but the discrete tokens that went into it would have been.

          If you similarly trained a Markov chain at the token level on a LLM sized corpus, it could make the same. Lacking an attention mechanism, the token probabilities would be terribly non constructive for the effort, but it is not impossible.

          • Sohcahtoa82 8 hours ago ago

            Let's assume three things here:

            1. The corpus contains every Trump speech.

            2. The corpus contains everything ever written about XSS.

            3. The corpus does NOT contain Trump talking about XSS, nor really anything that puts "Trump" and "XSS" within the same page.

            A Markov Chain could not produce a speech about XSS in the style of Trump. The greatest tuning factor for a Markov Chain is the context length. A short length (like 2-4 words) produces incoherent results because it only looks at the last 2-4 words when predicting the next word. This means if you prompted the chain with "Create a speech about cross-site scripting the style of Donald Trump", then even with a 4-word context, all the model processes is "style of Donald Trump". But the time it reached the end of the prompt, it's already forgotten the beginning of it.

            If you increase the context to 15, then the chain would produce nothing because "Create a speech about cross-site scripting in the style of Donald Trump" has never appeared in its corpus, so there's no data for what to generate next.

            The matching in a Markov Chain is discrete. It's purely a mapping of (series of tokens) -> (list of possible next tokens). If you pass in a series of tokens that was never seen in the training set, then the list of possible next tokens is an empty set.

            • 0cf8612b2e1e 6 hours ago ago

              At the token, not word level, it would be possible for a Markov chain. It never has to know about Trump or XSS, only that it sees tokens like “ing”, “ed”, “is”, and so forth. Given a LLM size corpus, which will have ~all token-to-token pairs with some non-zero frequency, the above could be generated.

              The actual probabilities will be terrible, but it is not impossible.

        • johnisgood 12 hours ago ago

          Oh, of course, what I want answered did not have much to do with Markov Chain, but LLMs, because I saw this argument often against LLMs.

      • astrange 7 hours ago ago

        > Edit: to clarify further as to what I want to know: people have been telling me that LLMs cannot solve problems that is not in their training data already. Is this really true or not?

        That is not true and those people are dumb. You may be on Bluesky too much.

        If your training data is a bunch of integer additions and you lossily compress this into a model which rediscovers integer addition, it can now add other numbers. Was that in the training data?

        • spookie 2 hours ago ago

          It was in the training data. There is implicit information in the way you present each addition. The context provided in the training data is what allows relationships to be perceived and modelled.

          If you don't have that in your data you don't have the results.

      • franciscator 11 hours ago ago

        Creativity need to be better defined. And the rest is a learning problem. If you keep on training, learning what you see ...

      • dboreham 8 hours ago ago

        > LLMs can only give you what its been trained on, i.e. it will not be "creative" or "revolutionary", that it will not output anything "new", but "only what is in its corpus

        That's not true. Or at least it's only a true as for a human that read all the books in the world. That human only has seen that training data. But somehow it can come up with the Higgs Boson, or whatever.

        • coderatlarge 7 hours ago ago

          well the people who did the Higgs boson theory worked and re-worked for years all the prior work about elementary particles and arguably did a bunch of re-mixing of all the previous “there might be a new elementary particle here!” work until they hit on something that convinced enough peers that it could be validated in a real-world experiment.

          by which i mean to say that it doesn’t seem completely implausible that an llm could generate the first tentative papers in that general direction. perhaps one could go back and compute the likelihood of the first papers on the boson given only the corpus to date before it as researchers seem to be trying to do with the special relativity paper which is viewed as a big break with physics beforehand.

      • godelski 11 hours ago ago

          > I have seen the argument that LLMs can only give you what its been trained
        
        There's confusing terminology here and without clarification people talk past one another.

        "What its been trained on" is a distribution. It can produce things from that distribution and only things from that distribution. If you train on multiple distributions, you get the union of the distribution, making a distribution.

        This is entirely different from saying it can only reproduce samples which it was trained on. It is not a memory machine that is surgically piecing together snippets of memorized samples. (That would be a mind bogglingly impressive machine!)

        A distribution is more than its samples. It is the things between too. Does the LLM perfectly capture the distribution? Of course not. But it's a compression machine so it compresses the distribution. Again, different from compressing the samples, like one does with a zip file.

        So distributionally, can it produce anything novel? No, of course not. How could it? It's not magic. But sample wise can it produce novel things? Absolutely!! It would be an incredibly unimpressive machine if it couldn't and it's pretty trivial to prove that it can do this. Hallucinations are good indications that this happens but it's impossible to do on anything but small LLMs since you can't prove any given output isn't in the samples it was trained on (they're just trained on too much data).

          > people have been telling me that LLMs cannot solve problems that is not in their training data already. Is this really true or not?
        
        Up until very recently most LLMs have struggled with the prompt

          Solve:
          5.9 = x + 5.11
        
        This is certainly in their training distribution and has been for years, so I wouldn't even conclude that they can solve problems "in their training data". But that's why I said it's not a perfect model of the distribution.

          > a pig with a dragon head
        
        One needs to be quite careful with examples as you'll have to make the unverifiable assumption that such a sample does not exist in the training data. With the size of training data this is effectively unverifiable.

        But I would also argue that humans can do more than that. Yes, we can combine concepts, but this is a lower level of intelligence that is not unique to humans. A variation of this is applying a skill from one domain into another. You might see how that's pretty critical to most animals survival. But humans, we created things that are entirely outside nature require things outside a highly sophisticated cut and paste operation. Language, music, mathematics, and so much more are beyond that. We could be daft and claim music is simply cut and paste of songs which can all naturally be reproduced but that will never explain away the feelings or emotion that it produces. Or how we formulated the sounds in our heads long before giving them voice. There is rich depth to our experiences if you look. But doing that is odd and easily dismissed as our own familiarity deceives us into our lack of.

        • XenophileJKO 3 hours ago ago

          The limit of a LLM "distribution" effectively is actually only at the token level though once the model has consumed enough language. Which is why those out of distribution tokens are so problematic.

          From that point on the model can infer linguistics even on purely encountered words, concepts. I would even propose in context inferred meaning based on context, just like you would do.

          It builds conceptual abstractions of MANY levels and all interrelated.

          So imagine giving it a task like "design a car for a penguin to drive". The LLM can infer what kinda of input does a car need, what anatomy does a penguin have and it can wire it up descriptively. It is an easy task for an LLM. When you think about the other capabilities like introspection, and external state through observation (any external input), there really are not many fundamental limits on what they can do.

          (Ignore image generation, it is an important distinction on how an image is made, end to end sequence vs. pure diffusion vs. hybrid.)

        • astrange 7 hours ago ago

          > This is entirely different from saying it can only reproduce samples which it was trained on. It is not a memory machine that is surgically piecing together snippets of memorized samples. (That would be a mind bogglingly impressive machine!)

          You could create one of those using both a Markov chain and an LLM.

          https://arxiv.org/abs/2401.17377

      • thaumasiotes 12 hours ago ago

        >> The fact that they only generate sequences that existed in the source

        > I am quite confused right now. Could you please help me with this?

        This is pretty straightforward. Sohcahtoa82 doesn't know what he's saying.

        • Sohcahtoa82 12 hours ago ago

          I'm fully open to being corrected. Just telling me I'm wrong without elaborating does absolutely nothing to foster understanding and learning.

          • thaumasiotes 12 hours ago ago

            If you still think there's something left to explain, I recommend you read your other responses. Being restricted to the training data is not a property of Markov output. You'd have to be very, very badly confused to think that it was. (And it should be noted that a Markov chain itself doesn't contain any training data, as is also true of an LLM.)

            More generally, since an LLM is a Markov chain, it doesn't make sense to try to answer the question "what's the difference between an LLM and a Markov chain?" Here, the question is "what's the difference between a tiny LLM and a Markov chain?", and assuming "tiny" refers to window size, and the Markov chain has a similarly tiny window size, they are the same thing.

            • johnisgood 12 hours ago ago

              He said LLMs are creative, yet people have been telling me that LLMs cannot solve problems that is not in their training data. I want this to be clarified or elaborated on.

              • shagie 11 hours ago ago

                Make up a fanciful problem and ask it to solve it. For example, https://chatgpt.com/s/t_691f6c260d38819193de0374f090925a is unlikely to be found in the training data - I just made it up. Another example of wizards and witches and warriors and summoning... https://chatgpt.com/share/691f6cfe-cfc8-8011-b8ca-70e2c22d36... - I doubt that was in the training data either.

                Make up puzzles of your own and see if it is able to solve it or not.

                The blanket claim of "cannot solve problems that are not in its training data" seems to be something that can be disproven by making up a puzzle from your own human creativity and seeing if it can solve it - or for that matter, how it attempts to solve it.

                It appears that there is some ability for it to reason about new things. I believe that much of this "an LLM can't do X" or "an LLM is parroting tokens that it was trained on" comes from trying to claim that all the material that it creates was created before, by a human and any use of an LLM is stealing from some human and thus unethical to use.

                ( ... and maybe if my block world or wizards and warriors and witches puzzle was in the training data somewhere, I'm unconsciously copying something somewhere else and my own use of it is unethical. )

                • wadadadad 10 hours ago ago

                  This is an interesting idea, but as you stated, it's all logic; it's hard to come up with an idea where you don't have to explain concepts yet still is dissimilar enough to be in the training.

                  In your second example with the wizards- did you notice that it failed to follow the rules? Step 3, the witch was summoned by the wizard. I'm curious as to why you didn't comment either way on this.

                  On a related note, instead of puzzles, what about presenting riddles? I would argue that riddles are creative, pulling bits and pieces of meaning from words to create an answer. If AI can solve riddles not seen before, would that count as creative and not solving problems in their dataset?

                  Here's one I created and presented (the first incorrect answer I got was Escape Room; I gave it 10 attempts and it didn't get the answer I was thinking of):

                  ---

                  Solve the riddle:

                  Chaos erupts around

                  The shape moot

                  The goal is key

                  • shagie 7 hours ago ago

                    The challenge is: for someone who is convinced that an LLM is only presenting material that they've seen before that was created by some human, how do you show them something that hasn't been seen before?

                    (Digging in old chats one from 2024 this one is amusing ... https://chatgpt.com/share/af1c12d5-dfeb-4c76-a74f-f03f48ce3b... was a fun one - epic rap battle between Paul Graham and Commander Taco )

                    Many people seem to believe that the LLM is not much more than a collage of words that it stole from other places and likewise images are a collage of images stolen from other people's pictures. (I've had people on reddit (which tends to be rather AI hostile outside of specific AI subs) downvote me for explaining how to use an LLM as an editor for your own writing or pointing out that some generative image systems are built on top of libraries where the company had rights (e.g. stock photography) to all the images)

                    With the wizards, I'm not interested necessarily in the correct solution, but rather how it did it and what the representation of the response was. I selected everything with 'W' to see how it handled identifying the different things.

                    As to riddles... that's really a question of mind reading. Your riddle isn't one that I can solve. Maybe if you told me the answer I'd understand how you got from the answer to the question, but I've got no idea how to go from the hint to a possible answer (does that make me an LLM?)

                    I feel its a question much more along some other classic riddles...

                        “What have I got in my pocket?" he said aloud. He was talking to himself, but Gollum thought it was a riddle, and he was frightfully upset. "Not fair! not fair!" he hissed. "It isn't fair, my precious, is it, to ask us what it's got in its nassty little pocketsess?”
                    
                    What do I have in my pocket? (and then a bit of "what would it do with that prompt?") https://chatgpt.com/s/t_691fa7e9b49081918a4ef8bdc6accb97

                    At this point, I'm much more of the opinion that some people are on "team anti-ai" and that it has become part of their identity to be against anything that makes use of AI to augment what a human can do unaided. Attempting to show that it's not a stochastic parrot or next token predictors (anymore than humans are) or that it can do things that help people (when used responsibly by the human) gets met with hostility.

                    I believe that this comes from the group identity and some of the things of group dynamics. https://gwern.net/doc/technology/2005-shirky-agroupisitsownw...

                    > The second basic pattern that Bion detailed is the identification and vilification of external enemies. This is a very common pattern. Anyone who was around the open source movement in the mid-1990s could see this all the time. If you cared about Linux on the desktop, there was a big list of jobs to do. But you could always instead get a conversation going about Microsoft and Bill Gates. And people would start bleeding from their ears, they would get so mad.

                    > ...

                    > Nothing causes a group to galvanize like an external enemy. So even if someone isn’t really your enemy, identifying them as an enemy can cause a pleasant sense of group cohesion. And groups often gravitate toward members who are the most paranoid and make them leaders, because those are the people who are best at identifying external enemies.

                • Ardren 8 hours ago ago

                  I think your example works, as it does try to solve a problem it hasn't seen (though it is very similar to existing problems).

                  ... But, CharGPT makes several mistakes :-)

                  > Wizard Teleport: Wz1 teleports himself and Wz2 to Castle Beta. This means Wz1 has used his only teleport power.

                  Good.

                  > Witch Summon: From Castle Beta, Wi1 at Castle Alpha is summoned by Wz1. Now Wz1 has used his summon power.

                  Wizzard1 cannot summon.

                  > Wizard Teleport: Now, Wz2 (who is at Castle Beta) teleports back to Castle Alpha, taking Wa1 with him.

                  Warrior1 isn't at Castle beta

                  > Wizard Teleport: Wz2, from Castle Alpha, teleports with Wa2 to Castle Beta.

                  Wizzard2 has already teleported

            • astrange 7 hours ago ago

              An LLM is not a Markov chain of the input tokens, because it has internal computational state (the KV cache and residuals).

              An LLM is a Markov process if you include its entire state, but that's a pretty degenerate definition.

              • Jensson 3 hours ago ago

                > An LLM is a Markov process if you include its entire state, but that's a pretty degenerate definition.

                Not any more degenerate than a multi word bag of words markov chain, its exactly the same concept: you input a context of words / tokens and get a new word / token, the things you mention there are just optimizations around that abstraction.

                • astrange 2 hours ago ago

                  The difference is there are exponentially more states than an n-gram model. It's really not the same thing at all. An LLM can perform nearly arbitrary computation inside its fixed-size memory.

                  https://arxiv.org/abs/2106.06981

                  (An LLM with tool use isn't a Markov process at all of course.)

            • purple_turtle 12 hours ago ago

              1) being restricted to exact matches in input is definition of Markov Chains

              2) LLMs are not Markov Chains

              • saithound 7 hours ago ago

                A Markov chain [1] is a discrete-time stochastic process, in which the value of each variable depends only on the value of the immediately preceding variable, and not any variables in the past.

                LLMs are most definitely (discrete-time) Markov chains in this sense: the variables take their values in the context vectors, and the distribution of the new context window depends only on what was sampled previously context.

                A Markov chain is a Markov chain, no matter how you implement it in a computer, whether as a lookup table, or an ordinary C function, or a one-layer neural net or a transformer.

                LLMs and Markov text generators are technically both Markov chains, so some of the same math applies to both. But that's where the similarities end: e.g. the state space of an LLM is a context window, whereas the state space of a Markov text generator is usually an N-tuple of words.

                And since the question here is how tiny LLMs differ from Markov text generators, the differences certainly matter here.

                [1] https://en.wikipedia.org/wiki/Discrete-time_Markov_chain

              • thaumasiotes 10 hours ago ago

                > 1) being restricted to exact matches in input is definition of Markov Chains

                Here's wikipedia:

                > a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.

                A Markov chain is a finite state machine in which transitions between states may have probabilities other than 0 or 1. In this model, there is no input; the transitions occur according to their probability as time passes.

                > 2) LLMs are not Markov Chains

                As far as the concept of "Markov chains" has been used in the development of linguistics, they are seen as a tool of text generation. A Markov chain for this purpose is a hash table. The key is a sequence of tokens (in the state-based definition, this sequence is the current state), and the value is a probability distribution over a set of tokens.

                To rephrase this slightly, a Markov chain is a lookup table which tells you "if the last N tokens were s_1, s_2, ..., s_N, then for the following token you should choose t_1 with probability p_1, t_2 with probability p_2, etc...".

                Then, to tie this back into the state-based definition, we say that when we choose token t_k, we emit that token into the output, and we also dequeue the first token from our representation of the state and enqueue t_k at the back. This brings us into a new state where we can generate another token.

                A large language model is seen slightly differently. It is a function. The independent variable is a sequence of tokens, and the dependent variable is a probability distribution over a set of tokens. Here we say that the LLM answers the question "if the last N tokens of a fixed text were s_1, s_2, ..., s_N, what is the following token likely to be?".

                Or, rephrased, the LLM is a lookup table which tells you "if the last N tokens were s_1, s_2, ..., s_N, then the following token will be t_1 with probability p_1, t_2 with probability p_2, etc...".

                You might notice that these two tables contain the same information organized in the same way. The transformation from an LLM to a Markov chain is the identity transformation. The only difference is in what you say you're going to do with it.

                • Sohcahtoa82 8 hours ago ago

                  > Here we say that the LLM answers the question "if the last N tokens of a fixed text were s_1, s_2, ..., s_N, what is the following token likely to be?".

                  > Or, rephrased, the LLM is a lookup table which tells you "if the last N tokens were s_1, s_2, ..., s_N, then the following token will be t_1 with probability p_1, t_2 with probability p_2, etc...".

                  This is an oversimplication of an LLM.

                  The output layer of an LLM contains logits of all tokens in the vocabulary. This can mean every word, word fragment, punctuation mark, or whatever emoji or symbol it knows. Because the logits are calculated through a whole lot of floating point math, it's very likely that most results will be non-zero. Very close to zero, but still non-zero.

                  This means that gibberish options for the next token have non-zero probabilities of being chosen. The only reason they don't in reality is because of top-k sampling, temperature, and other filtering that's done on the logits before actually choosing a token.

                  If you present a s_1, s_2, ... s_N to a Markov Chain when that series was never seen by the chain, then the resulting probabilities is an empty set. But if you present them to an LLM, it gets fed into a neural network and eventually a set of logits comes out, and you can still choose the next token based on it.

                • purple_turtle 8 hours ago ago

                  Maybe technically LLM can be converted to an equivalent Markov Chain.

                  The problem is that even for modest context sizes the size of Markov Chain would by hilariously and monstrously large.

                  You may as well tell that LLM and a hash table is the same thing.

    • papyrus9244 9 hours ago ago

      > This is why it's impossible to create a digital assistant, or really anything useful, via Markov Chain. The fact that they only generate sequences that existed in the source mean that it will never come up with anything creative.

      Or, in other words, a Markov Chain won't hallucinate. Having a system that only repeats sentences from it's source material and doesn't create anything new on its own is quite useful on some scenarios.

      • Sohcahtoa82 8 hours ago ago

        > Or, in other words, a Markov Chain won't hallucinate.

        It very much can. Remember, the context windows used for Markov Chains are usually very short, usually in the single digit numbers of words. If you use a context length of 5, then when asking it what the next word should be, it has no idea what the words were before the current context of 5 words. This results in incoherence, which can certainly mean hallucinations.

    • dragonwriter 7 hours ago ago

      > A Markov Chain trained by only a single article of text will very likely just regurgitate entire sentences straight from the source material.

      Strictly speaking, this is true of one particular way (the most straightforward) to derive a Markov chain from a body of text; a Markov chain is just a probabilistic model of state transitions where the probability of each possible next state is dependent only on the current state. Having the states be word sequence of some number of words, overlapping by all but one word, and having the probabilities being simply the frequency with which the added word in the target state follows the sequence in the source state in the training corpus is one way you can derive a Markov chain from a body of text, but not the only one.

    • psychoslave 11 hours ago ago

      > You'll find that the resulting output becomes incoherent garbage.

      I also do that kind of things with LLM. The other day, I don't remember the prompt (something casual really, not trying to trigger any issue) but le chat mistral started to regurgitate "the the the the the...".

      And this morning I was trying a some local models, trying to see if they could output some Esperanto. Well, that was really a mess of random morphs thrown together. Not syntactically wrong, but so out of touch with any possible meaningful sentence.

      • lotyrin 11 hours ago ago

        Yeah, some of the failure modes are the same. This one in particular is fun because even a human, given "the the the" and asked to predict what's next will probably still answer "the". How a Markov chain starts the the train and how the LLM does are pretty different though.

    • Kinrany 4 hours ago ago

      Markov Chains could be applied on top of embeddings just as well though

    • vjerancrnjak 11 hours ago ago

      If you learn with Baum Welch you can get nonzero ood probabilities.

      Something like Markov Random Field is much better.

      Not sure if anyone managed to create latent hierarchies from chars to words to concepts. Learning NNs is far more tinkery than brutality of probabilistic graphical models.

    • ssivark 11 hours ago ago

      Uhhh... the above comment has a bunch of loose assertions that are not quite true, but with a enough truthiness that makes them hard to refute. So I'll point to my other comment for a more nuanced comparison of Markov models with tiny LLMs: https://news.ycombinator.com/item?id=45996794

  • kleiba 12 hours ago ago

    Markov chains of order n are essentially n-gram models - and this is what language models used to be for a very long time. They are quite good. As a matter of fact, they were so good that more sophisticated models often couldn't beat them.

    But then came deep-learning models - think transformers. Here, you don't represent your inputs and states discretely but you have a representation in a higher-dimensional space that aims at preserving some sort of "semantics": proximity in that space means proximity in meaning. This allows to capture nuances much more finely than it is possible with sequences of symbols from a set.

    Take this example: you're given a sequence of n words and are to predict a good word to follow that sequence. That's the thing that LM's do. Now, if you're an n-gram model and have never seen that sequence in training, what are you going to predict? You have no data in your probabilty tables. So what you do is smoothing: you take away some of the probability mass that you have assigned during training to the samples you encountered and give it to samples you have not seen. How? That's the secret sauce, but there are multiple approaches.

    With NN-based LLMs, you don't have that exact same issue: even if you have never seen that n-word sequence in training, it will get mapped into your high-dimensional space. And from there you'll get a distribution that tells you which words are good follow-ups. If you have seen sequences of similar meaning (even with different words) in training, these will probably be better predictions.

    But for n-grams, just because you have seen sequences of similar meaning (but with different words) during training, that doesn't really help you all that much.

    • sadid 2 hours ago ago

      yes, but on this n-gram vs transformers; if you consider more general paradigm, self attention mechanism is basically a special form of a graph neural networks [1].

      [1] Bridging Graph Neural Networks and Large Language Models: A Survey and Unified Perspective https://infoscience.epfl.ch/server/api/core/bitstreams/7e6f8...

    • ActorNightly 8 hours ago ago

      In theory, you could have a large enough markov chain that mimicks an LLM, you would just need it to be exponentially larger in width.

      After all, its just matrix multplies start to finish.

      A lot of the other data operation (like normalization) can be represented as matrix multiplies, just less efficiently. In the same way that a transformer can be represented inefficiency as a set of fully connected deep layers.

    • esafak 11 hours ago ago
  • lukev 10 hours ago ago

    Other comments in this thread do a good job explaining the differences in the Markov algorithm vs the transformer algorithm that LLMs use.

    I think it's worth mentioning that you have indeed identified a similarity, in that both LLMs and Markov chain generators have the same algorithm structure: autoregressive next-token generation.

    Understanding Markov chain generators is actually a really really good step towards understanding how LLMs work, overall, and I think its a really good pedagogical tool.

    Once you understand Markov generating, doing a bit of handwaving to say "and LLMs are just like this except with a more sophisticated statistical approach" has the benefit of being true, demystifying LLMs, and also preserving a healthy respect for just how powerful that statistical model can be.

  • MarkusQ 3 days ago ago

    LLMs include mechanisms (notably, attention) that allow longer-distance correlations than you could get with a similarly-sized Markov chain. If you squint hard enough though, they are Markov chains with this "one weird trick" that makes them much more effective for their size.

  • inciampati 12 hours ago ago

    Markov chains have exponential falloff in correlations between tokens over time. That's dramatically different than real text which contains extremely long range correlations. They simply can't model long range correlations. As such, they can't be guided. They can memorize, but not generalize.

    • kittikitti 11 hours ago ago

      As someone who developed chatbots with HMM's and the Transformers algorithms, this is a great and succinct answer. The paper, Attention Is All You Need, solved this drawback.

      • vjerancrnjak 10 hours ago ago

        Markov Random Fields also do that.

        Difference is obviously there but nothing prevents you from undirected conditioning of long range dependencies. There’s no need to chain anything.

        The problem from a math standpoint is that it’s an intractable exercise. The moment you start relaxing the joint opt problem you’ll end up at a similar place.

    • zwaps 12 hours ago ago

      This is the correct answer

  • qoez 11 hours ago ago

    From a core openai insider who have likely trained very large markov models and large transformers: https://x.com/unixpickle/status/1935011817777942952

    Untwittered: A Markov model and a transformer can both achieve the same loss on the training set. But only the transformer is smart enough to be useful for other tasks. This invalidates the claim that "all transformers are doing is memorizing their training data".

  • wodenokoto 2 hours ago ago

    I'd say one of the main differences is that a Markov chain trained over N-grams works on discreet n-grams. Therefore the markov chain cannot tell the difference between two contexts never seen in training. They will both be the "unknown"-token.

    An LLM will see a bunch of smaller tokens in a novel order and interpret that.

  • yobbo 11 hours ago ago

    A hidden markov model (HMM) is theoretically capable of modelling text just as well any transformer. Typically, HMMs are probability distributions over a hidden discrete state space, but the distribution and state space can be anything. The size of the state space and transition function determines its capacity. RNNs are effectively HMMs, and recent ones like "Mamba" and so on are considered competent.

    Transformers can be interpreted as tricks that recreate the state as a function of the context window.

    I don't recall reading about attempts to train very large discrete (million states) HMMs on modern text tokens.

  • thatjoeoverthr 12 hours ago ago

    Others have mentioned the large context window. This matters.

    But also important is embeddings.

    Tokens in a classic Markov chain are discrete surrogate keys. “Love”, for example, and “love” are two different tokens. As are “rage” and “fury”.

    In a modern model, we start with an embedding model, and build a LUT mapping token identities to vectors.

    This does two things for you.

    First, it solves the above problem, which is that “different” tokens can be conceptually similar. They’re embedded in a space where they can be compared and contrasted in many dimensions, and it becomes less sensitive to wording.

    Second, because the incoming context is now a tensor, it can be used with differentiable model, back propagation and so forth.

    I did something with this lately, actually, using a trained BERT model as a reranker for Markov chain emmisions. It’s rough but manages multiturn conversation on a consumer GPU.

    https://joecooper.me/blog/crosstalk/

  • tlarkworthy 12 hours ago ago

    Markov chains learn a fixed distribution, but transformers learn the distribution of distributions and latch onto what the current distribution based on evidence seen so far. So that's where the single shot learning comes from in transformer. Markov chains can't do that, they will not change the underlying distribution as they read.

    • krackers 2 hours ago ago

      https://www.lesswrong.com/posts/gTZ2SxesbHckJ3CkF/transforme... explains this more, see the part about mixed state presentation & belief synchronization

      >Another way to think about our claim is that transformers perform two types of inference: one to infer the structure of the data-generating process, and another meta-inference to update it's internal beliefs over which state the data-generating process is in, given some history of finite data (ie the context window). This second type of inference can be thought of as the algorithmic or computational structure of synchronizing to the hidden structure of the data-generating process.

  • unoti 9 hours ago ago

    If you're not sure about what a Markov Chain is, or if you've never written something from scratch that learns, take a look at this repo I made to try to bridge that gap and make it simple and understandable. You can read it in a few minutes. It starts with nothing but Python, and ends with generating text based on the D&D Dungeon Master Manual. https://github.com/unoti/markov-basics/blob/main/markov-basi...

  • kleiba 10 hours ago ago

    What's a tiny LLM, given that the first L stands for "large"?

  • Anon84 11 hours ago ago

    Not that different. In fact, you can use Markov Chain theory as an analytical tool to study LLMs: https://arxiv.org/abs/2410.02724

    You could probably point your code to Google Books N-grams (https://storage.googleapis.com/books/ngrams/books/datasetsv3...) and get something that sounds (somewhat) reasonable.

  • dragonwriter 8 hours ago ago

    I think, strictly speaking, autoregressive LLMs are markov chains of a very high order.

    The trick (aside from the order) is the training process by which they are derived from their source data. Simply enumerating the states and transitions in the source data and the probability of each transition from each state in the source doesn’t get you an LLM.

    • krackers an hour ago ago

      I always like to think LLMs are markov models in the way that real-world computers are finite state machines. It's technically true, but not a useful abstraction at which to analyze them.

      Both LLMs and n-gram models satisfy the markov property, and you could in principle go through and compute explicit transition matrices (something on the size of vocab_size*context_size I think). But LLMs aren't trained as n-gram models, so besides giving you autoregressive-ness, there's not really much you can learn by viewing it as a markov model

  • canjobear 10 hours ago ago

    Right there in the second sentence. The sentence is ungrammatical because the n-gram model can’t handle long term dependencies. “To obtain such a large medical database of the joints are risk factors” — the verb should be singular and the predicate should be something else, like “to obtain such a large medical database … is difficult”

  • OvrUndrInformed 9 hours ago ago

    Markov Processes are a pretty general concept that can be used to model just about anything if you let the “state” also incorporate some elements of the “history”. I assume that the way transformers are used to model language (next token prediction) can be considered a Markov process where the transition function is modeled by the LLM. Transitions between a state (given by [n] previous tokens, which are the context + text generated so far) and the next state (given by state[n+1] tokens, which are the context + text generated so far + the newest generated token) are given by the probability distribution output by the LLM. Basically I think you can consider auto-regressive LLMs as parameterized Markov processes. Feel free to correct me if I’m wrong.

  • currymj 11 hours ago ago

    bigram-trigram language models (with some smoothing tricks to allow for out-of-training-set generalization) were state of the art for many years. Ch. 3 of Jurafsky's textbook (which is modern and goes all the way to LLMs, embeddings etc.) is good on this topic.

    https://web.stanford.edu/~jurafsky/slp3/ed3book_aug25.pdf

    I don't know the history but I would guess there have been times (like the 90s) when the best neural language models were worse than the best trigram language models.

  • ComplexSystems 6 hours ago ago

    A few things:

    First, modern LLMs can be thought, abstractly, as a kind of Markov model. We are taking the entire previous output as one state vector and from there we have a distribution to the next state vector, which is the updated output with another token added. The point is that there is some subtlety in what a "state" is. So that's one thing.

    But the point of the usual Markov chain is that we need to figure out the next conditional probability based on the entire previous history. Making a lookup table based on an exponentially increasing history of possible combinations of tokens is impossible, so we make a lookup table on the last N tokens instead - this is an N-gram LLM or an N'th order Markov chain, where states are now individual tokens. It is much easier, but it doesn't give great results.

    The main reason here is that sometimes, the last N words (or tokens, whatever) simply do not have sufficient info about what the next word should be. Often times some fragment of context way back at the beginning was much more relevant. You can increase N, but then sometimes there are a bunch of intervening grammatical filler words that are useless, and it also gets exponentially large. So the 5 most important words to look at, given the current word, could be 5 words scattered about the history, rather than the last 5. And this is always evolving and differs for each new word.

    Attention solves this problem. Instead of always looking at the last 5 words, or last N words, we have a dynamically varying "score" for how relevant each of the previous words is given the current one we want to predict. This idea is closer to the way humans parse real language. A Markov model can be thought of as a very primitive version of this where we always just attend evenly to the last N tokens and ignore everything else. So you can think of attention as kind of like an infinite-order Markov chain, but with variable weights representing how important past tokens are, and which is always dynamically adjusting as the text stream goes on.

    The other difference is that we no longer can have a simple lookup table like we do with n-gram Markov models. Instead, we need to somehow build some complex function that takes in the previous context and computes outputs the correct next-token distribution. We cannot just store the distribution of tokens given every possible combination of previous ones (and with variable weights on top of it!), as there are infinitely many. It's kind of like we need to "compress" the hypothetically exponentially large lookup table into some kind of simple expression that lets us compute what the lookup table would be without having to store every possible output at once.

    Both of these things - computing attention scores, and figuring out some formula for the next-token distribution - are currently solved with deep networks just trying to learn from data and perform gradient descent until it magically starts giving good results. But if the network isn't powerful enough, it won't give good results - maybe comparable to a more primitive n-gram model. So that's why you see what you are seeing.

  • robot-wrangler 11 hours ago ago
  • aespinoza 12 hours ago ago

    Would you be willing to write an article comparing the results ? Or share the code you used to test? I am super interested in the results of this experiment.

  • amelius 8 hours ago ago

    Markov chains are very versatile. When you're holding a hammer ...

  • bilsbie 9 hours ago ago

    It wouldn’t be a Markov chain but I’d imagine you could make something functionally close to an LLM with creative uses of word frequency counts and probabilities over the massive amounts of text.

  • ssivark 11 hours ago ago

    The Markov property means that the next token is determined purely by the current token. Well, if it were a Hidden Markov Model, the next state would actually be determined by the current state, and the respective tokens would be a lossy representation of the states.

    The problem with HMMs is that the sequence model (Markov transition matrix) accounts for much less context than even Tiny LLMs. One natural way to improve this is to allow the model to have more hidden states, representing more context -- called "clones" because these different hidden states would all be producing the same token while actually carrying different underlying contexts that might be relevant for future tokens. We are thus taking a non-Markov model (like a transformer) and re-framing its representation to be Markov. There have been sequence models with this idea aka Cloned HMMs (CHMMs) [1] or Clone-Structured Cognitive Graphs (CSCGs) [2]. The latter name is inspired by some related work in neuroscience, to which these were applied, which showed how these graphical models map nicely to "cognitive schemas" and are particularly effective in discovering interpretable models of spatial structure.

    I did some unpublished work a couple of years ago (while at Google DeepMind) studying how CHMMs scale to simple ~GB sized language data sets like Tiny Stories [3]. As a subjective opinion, while they're not as good as small transformers, they do generate text that is surprisingly good compared with naive expectations of Markov models. The challenge is that learning algorithms that we typically use for HMMs (eg. Expectation Maximization) are somewhat hard to optimize & scale for contemporary AI hardware (GPU/TPU), and a transformer model trained by gradient descent with lots of compute works pretty well, and also scales well to larger datasets and model sizes.

    I later switched to working on other things, but I still sometimes wonder whether it might be possible to cook up better learning algorithms attacking the problem of disambiguating contexts during the learning phase. The advantage with an explicit/structured graphical model like a CHMM is that it is very interpretable, and allows for extremely flexible queries at inference time -- unlike transformers (or other sequence models) which are trained as "policies" for generating token streams.

    When I say that transformers don't allow flexible querying I'm glossing over in-context learning capabilities, since we still lack a clear/complete understanding and what kinds of pre-training and fine-tuning one needs to elicit them (which are frontier research questions at the moment, and it requires a more nuanced discussion than a quick HN comment).

    It turns out, funnily, that these properties of CHMMs actually proved very useful [4] in understanding the conceptual underpinnings of in-context learning behavior using simple Markov sequence models instead of "high-powered" transformers. Some recent work from OpenAI [5] on sparse+interpretable transformer models seems to suggest that in-context learning in transformer LLMs might work analogously, by learning schema circuits. So the fact that we can learn similar schema circuits with CHMMs makes me believe that what we have is a learning challenge and it's not actually a fundamental representational incapacity (as is loosely claimed sometimes). In the spirit of full disclosure, I worked on [4]; if you want a rapid summary of all the ideas in this post, including a quick introduction to CHMMs, I would recommend the following video presentation / slides [6].

    [1]: https://arxiv.org/abs/1905.00507

    [2]: https://www.nature.com/articles/s41467-021-22559-5

    [3]: https://arxiv.org/abs/2305.07759

    [4]: https://arxiv.org/abs/2307.01201

    [5]: https://openai.com/index/understanding-neural-networks-throu...

    [6]: https://slideslive.com/39010747/schemalearning-and-rebinding...

  • spencerflem 12 hours ago ago

    Iirc there was some paper that showed that LLMs could be converted to Markov chains and vice versa, but the size of the chain was much much higher

  • AndrewKemendo 12 hours ago ago

    Your example is too sparse to make a conclusion from

    I’d offer an alternative interpretation: LLMs follow the Markov Decison modeling properties to encode the problem but use a very efficient policy for solver for the specific token based action space.

    That is to say they are both within the concept of a “markovian problem” but have wildly different path solvers. MCMC is a solver for an MDP, as is an attention network

    So same same, but different