138 comments

  • Jackson__ 5 hours ago ago

    API only model, yet trying to compete with only open models in their benchmark image.

    Of course it'd be a complete embarrassment to see how hard it gets trounced by GPT4o and Claude 3.5, but that's par for the course if you don't want to release model weights, at least in my opinion.

    • Jackson__ 5 hours ago ago

      I'd also like to point out that they omit Qwen2.5 14B from the benchmark because it doesn't fit their narrative(MMLU Pro score of 63.7[0]). This kind of listing-only-models-you-beat feels extremely shady to me.

      [0] https://qwenlm.github.io/blog/qwen2.5/

    • GaggiX 5 hours ago ago

      Yes, I agree, for these small models it's wasted potential to be closed source, they can only be used effectively if they are open.

      EDIT: HN is rate-limiting me so I will reply here: In my opinion 1B and 3B truly shine on edge devices, if not than it's not worth the effort, you can have much better models for already dirt cheap using an API.

      • k__ 5 hours ago ago

        While I'm all for open models; why can't the small models not be used effectively?

        Wouldn't they lower the costs compared to big models drastically?

        • lumost 2 hours ago ago

          An open small model means I can experiment with it. I can put it on an edge device and scale to billions of users, I can use it with private resources that I can't send externally.

          When it's behind an API its just a standard margin/speed/cost discussion.

        • derefr 4 hours ago ago

          Big models take up more VRAM just to have the weights sitting around hot in memory, yes. But running two concurrent inferences on the same hot model, doesn't require that you have two full copies of the model in memory. You only need two full copies of the model's "state" (the vector that serves as the output of layer N and the input of layer N+1, and the pool of active low-cardinality matrix-temporaries used to batchwise-compute that vector.)

          It's just like spawning two copies of the same program, doesn't require that you have two copies of the program's text and data sections sitting in your physical RAM (as those get mmap'ed to the same shared physical RAM); it only requires that each process have its own copy of the program's writable globals (bss section), and have its own stack and heap.

          Which means there are economies of scale here. It is increasingly less expensive (in OpEx-per-inference-call terms) to run larger models, as your call concurrency goes up. Which doesn't matter to individuals just doing one thing at a time; but it does matter to Inference-as-a-Service providers, as they can arbitrarily "pack" many concurrent inference requests from many users, onto the nodes of their GPU cluster, to optimize OpEx-per-inference-call.

          This is the whole reason Inference-aaS providers have high valuations: these economies of scale make Inference-aaS a good business model. The same query, run in some inference cloud rather than on your device, will always achieve a higher-quality result for the same marginal cost [in watts per FLOP, and in wall-clock time]; and/or a same-quality result for a lower marginal cost.)

          Further, one major difference between CPU processes and model inference on a GPU, is that each inference step of a model is always computing an entirely-new state; and so compute (which you can think of as "number of compute cores reserved" x "amount of time they're reserved") scales in proportion to the state size. And, in fact, with current Transformer-architecture models, compute scales quadratically with state size.

          For both of these reasons, you want to design models to minimize 1. absolute state size overhead, and 2. state size growth in proportion to input size.

          The desire to minimize absolute state-size overhead, is why you see Inference-as-a-Service providers training such large versions of their models (OpenAI's 405b models, etc.) The hosted Inference-aaS providers aren't just attempting to make their models "smarter"; they're also attempting to trade off "state size" for "model size." (If you're familiar with information theory: they're attempting to make a "smart compressor" that minimizes the message-length of the compressed message [i.e. the state] by increasing the information embedded in the compressor itself [i.e. the model.]) And this seems to work! These bigger models can do more with less state, thereby allowing many more "cheap" inferences to run on single nodes.

          The particular newly-released model under discussion in this comments section, also has much slower state-size (and so compute) growth in proportion to its input size. Which means that there's even more of an economy-of-scale in running nodes with the larger versions of this model; and therefore much less of a reason to care about smaller versions of this model.

          • zozbot234 3 hours ago ago

            > It is increasingly less expensive (in OpEx-per-inference-call terms) to run larger models, as your call concurrency goes up. Which doesn't matter to individuals just doing one thing at a time; but it does matter to Inference-as-a-Service providers, as they can arbitrarily "pack" many concurrent inference requests from many users

            In a way it also matters to individuals, because it allows them to run more capable models with a limited amount of system RAM. Yes, fetching model parameters from mass storage during inference is going to be dog slow (while NVMe transfer bandwidth is getting up there, it's not yet comparable to RAM) but that matters if you insist on getting your answer interactively, in real time. With a local model, it's trivial to make LLM inference a batch task. Some LLM inference frameworks can even save checkpoints for a single inference to disk and be cleanly resumed later.

          • liuliu 4 hours ago ago

            > they're attempting to make a "smart compressor" that minimizes the message-length of the compressed message [i.e. the state] by increasing the information embedded in the compressor itself [i.e. the model.]) And this seems to work! These bigger models can do more with less state, thereby allowing many more "cheap" inferences to run on single nodes.

            Not sure I follow. CoT and go over length of the states is a relatively new phenomenon and I doubt when training the model, minimize the length of CoT is an explicit goal.

            The only thing probably relevant to this comment is the use of grouped-query attention? That reduces the size of KV cache by factor of 4 to 8 depending on your group strategy. But I am unsure there is a clear trade-off between model size / grouped-query size given smaller KV cache == smaller model size naively.

        • Bilal_io 5 hours ago ago

          I think what the parent means is that small models are more useful locally on mobile, IoT devices etc. so it defeats the purpose to have to call an API.

        • echelon 5 hours ago ago

          These aren't the "small" models I'm thinking of.

          I want an LLM, STT, or TTS model to run efficiently on a Raspberry Pi with no GPU and no network.

          There is huge opportunity for LLM-based toys, tools, sensors, and the like. But they need to work sans internet.

      • zozbot234 5 hours ago ago

        There are small proprietary models such as Claude Haiku and GPT 4o-mini.

        • GaggiX 5 hours ago ago

          They are way bigger than 1B or 3B.

  • zozbot234 5 hours ago ago

      > How many r's in "strawberry"? Think step by step!
    
      What a nice question! Well, let's just count:
      1. S
      2. T
      3. R!
    
      So, we have our answer: there are 3 R's in "strawberry".
    
    Totally checks out!
    • alach11 5 hours ago ago

      This is one of the least interesting questions to ask LLMs. I wish it wasn't so popular.

      • WhitneyLand 5 hours ago ago

        I find it fascinating. To me it grounds expectations and motivates discussions about the jagged boundaries of LLM capabilities.

        • hn_throwaway_99 4 hours ago ago

          Why do you find it fascinating? I have the most rudimentary understanding of LLMs and it seems to me the least fascinating thing about LLM limitations.

          That is, all LLMs look at language as a series of opaque, independent tokens, e.g. strawberry might be a single token (say <5678>), or probably two (e.g. straw and berry, say <123><789>). But in no way will it represent it like we will, with letters. So if you ask it "how many R's are in strawberry", it fundamentally can't do any "counting", it just basically has to rely on whether that question (or similar questions about related words) has been asked before so it can predict the next token in its output correctly. I suppose with enough data and LLM could learn to associate character counts with tokens (e.g. with the right training set it could learn metadata about the tokens).

          My point being that with only the most basic understanding of how LLMs function, this "character counting" limitation seems blatantly obvious.

          • WhitneyLand 4 hours ago ago

            I don’t think we’ve yet come to the point where, how an LLM end to end goes from prompt to output is blatantly obvious.

            LLMs operate with language at multiple levels of abstraction and tokens are not the only way to have character level knowledge.

            For example, nothing excludes pre-training data from directly or indirectly encoding such knowledge.

            And of course LLMs famously have emergent properties, for which there’s not yet a precise mechanism to illuminate the results.

            We’re dealing with very complex systems that are still relatively poorly understood, and I believe the pool of concepts understood to the point of being blatantly obvious is still small.

            • hn_throwaway_99 3 hours ago ago

              The fact that large parts of how LLMs work is poorly understood doesn't mean that that we should brush the whole thing with "it's a complex mystery!"

              We know exactly how LLMs tokenize their input - OpenAI even has a public tool you can test out, https://platform.openai.com/tokenizer .

              When you say "For example, nothing excludes pre-training data from directly or indirectly encoding such knowledge." - yes, that's why I explicitly said "e.g. with the right training set it could learn metadata about the tokens".

              But the point was still put perfectly by another commenter: "How many 1 bits are there in strawberry?" When nearly all humans can't answer that, we're very unsurprised; we see it immediately as a simple difference in how humans encode that string vs. how computers do it. We don't say "Well, the human mind is so complex and filled with emergent properties that the reason for this human limitation is a big mystery". And we also know that if we teach a human the encoding rules for characters (i.e. the bit pattern of each letter), they could answer this question, and similarly that's the analogous thing to training an LLM to learn metadata about the tokens.

              Now, what I think is very interesting is why it's so hard to teach an LLM to say "I don't know" when asked to count characters. In my opinion, that's a much more interesting limitation that gets at some of the root, fundamental differences in how LLMs function.

              • homarp an hour ago ago

                teaching LLM to say 'i don't know' in general is hard, not just for counting letters.

      • NoCoooode 5 hours ago ago

        That they can't do this sort of simple question speaks volumes to the entire approach.

        I don't think generative AI will ever be able to reach AGI, and most people selling LLM today pretend it is AGI

        • alach11 5 hours ago ago

          All it speaks to is that tokenization is weird and introduces artifacts to LLM performance. Counting letters is a trivial task when you're staring at words on a screen. It's much harder when you're perceiving vectors based on parts of words. The fact that LLMs find certain things easier/harder than humans is completely unsurprising, and there are much more interesting benchmarks to use to compare one LLM to another.

          • doctorpangloss 5 hours ago ago

            Counting shit, like cells, peaks in signals, people, inventory, fingers, and votes, is hard, tedious and important to business and life, so I don’t know dude, it seems like a great benchmark to me. Countless posts wasted on denying this simple and obvious fact.

            • hn_throwaway_99 4 hours ago ago

              Not being able to "do math" is an obvious, IMO uninteresting limitation of how LLMs inherently work, and the more advanced models have already figured out good solutions. E.g. while an LLM may stumble on "How many R's are in Strawberry", at this point any code model can easily correctly implement "write me a program in Python that counts the number of R's in strawberry".

              GPT 3 would essentially always fail on higher math problems/arithmetic, but whenever I've used it for math GPT 4 has always delegated out to executing code where necessary.

            • Scarblac 4 hours ago ago

              So how many 1 bits are in strawberry?

              Thats about the same kind of question, given that LLMs see tokens instead of characters, and we see characters instead of bits.

              • doctorpangloss an hour ago ago

                Do you sincerely think the folks training these models haven't tried alternative tokenizations?

            • BoorishBears 4 hours ago ago

              It's like using a hammer to turn a screw and calling it useless.

              To envision what a next generation model bound by the same constraints should do, it'd be to recognize that it can't count tokens and use code access to write code that solves the strawberry problem without prompting.

              Asked to count cells it'd be a model that could write and execute OpenCV tasks. Or to go a step further, be a multimodal model that can synthesize 10000 varations of the target cell, and finetune a model like YOLO on it autonomously.

              I find arguments that reduce LLMs to "It can't do the simple thing!!!!" come from people unable to apply lateral thinking to how a task can be solved.

              • doctorpangloss an hour ago ago

                > To envision what a next generation model bound by the same constraints should do, it'd be to recognize that it can't count tokens and use code access to write code that solves the strawberry problem without prompting.

                The VQA problems I'm describing can be solved seemingly in one case but not combined with counting. Counting is fundamentally challenging for sort of unknown reasons, or perhaps known to the very best labs who are trying to tackle it directly.

                Another POV is that the stuff you are describing is in some sense so obvious that it has been tried, no?

          • jrflowers 5 hours ago ago

            This is a good point. While LLMs being incapable of reliably doing a simple task that’s been doable by computers since the punch card days is an important consideration for anyone that might be thinking about using them for anything other than as a toy, this fact is uninteresting because of Reasons

            • Nevermark 5 hours ago ago

              Try counting the number of your red retina cells that are firing while you look at a painting.

              Don’t need to be exact as firing is statistical, just give us a good average.

              Hard? You can’t count?

              Computers count pixels no problem. So weird you can’t.

              Dementia? Not an AGI? /h

              —-

              This is what is happening.

              Here are the “Reasons”.

              In your vision system, the raw information from individual retina signals is munged into a different representation before reaching a level where you have flexible processing.

              Likewise, in LLMs, letters are munged into tokens before LLMs “see” them.

              When they sometimes get that “simple” question right, it’s actually a bit of an amazing feat. Given how they are constructed.

              —-

              Now try counting R’s as you read at a normal rate, or listen to someone speak.

              You can’t do that either, during normal processing.

              When we add spelling to LLMs training examples, they will do it easily. Just as you learned to do it, only after special lessons, after you had already learned to listen and speak.

              Spelling is its own special practiced skill, in humans and LLMs.

              • jrflowers 5 hours ago ago

                > Try counting the number of your red retina cells that are firing during while you look at a painting.

                This analogy makes sense because everybody could count their red retina cells until a couple years ago when the new painting paradigm arose, and also counting red retinal cells is a good analogy for being able to see simple objects that have always been distinguishable.

                It is fascinating how tapping the “Do Not Use LLMs For Computation If The Results Need To Be Reliably Better Than A Random Output” sign invites explanations of why that fact is actually Cool and Good

                • Nevermark 4 hours ago ago

                  Ask anyone who has not specifically learned to spell, to count R’s while you speak.

                  You learned to listen and speak words before you could spell. Imagine if nobody had actually shown you written words?

                  Or they were speaking another dialect but expecting you to count R’s in standard English?

                  LLMs are not trained on words in the form of letters.

                  They process and generate the words in the form of tokens. Pre- and post-processing systems converts letters to tokens and the reverse, without their ability to access that processing.

                  Spelling, for both us and LLMs, requires specific training/lessons.

                  > It is fascinating how tapping the “Do Not Use LLMs For Computation If The Results Need To Be Reliably Better Than A Random Output” sign invites explanations of why that fact is actually Cool and Good

                  Also fascinating:

                  People who hallucinate/confabulate ridiculous straw man rationales for people they disagree with, unaware they are filling in gaps in their knowledge regarding other people’s actual reasoning and the actual subject at hand.

                  So LLM! Such unreliable processing!

                  Perhaps, start posting a reliability disclaimer?

                  • roywiggins 3 hours ago ago

                    The analogy I use is that illiterate people obviously can't spell, but it doesn't say much about their ability on other tasks. General intelligence doesn't need to be able to spell, since that describes a fair number of actual humans.

                    (There are tasks that LLMs totally fail on that would be obvious to an illiterate human though)

            • space_fountain 5 hours ago ago

              LLMs can clearly solve problems that computers up to now couldn't. They can't solve all problems and this should definitely be a cautionary note to anyone who wants to use them as an artificial general intelligence, but this take seems no different to someone looking at a punchcard computer and going, it can't even recognize typos or categorize images, what good is this? We've already had human computers who can do everything these can do, and can recognize images and notice typos

              • jrflowers an hour ago ago

                > going, it can't even recognize typos or categorize images, what good is this?

                No one said that LLMs aren’t good for anything.

                I pointed out — in response to another poster downplaying mention of a well-known and undisputed limitation that LLMs often have — that it is valid to consider these well-known and undisputed limitations if one is considering using them for anything other than a toy.

                It is downright silly to discourage discussion of well-known and undisputed limitations! The only reason for that can only be entirely emotional as there is genuinely nothing tangible to be gained by being steadfast in silence about a fact that isn’t up for debate.

              • evilduck 5 hours ago ago

                Also humans would revert to explicitly using an algorithm and external storage like a sheet of paper with tally marks or a spreadsheet or even a computer program if you scale the question up to a full sheet of text or a whole book or a collection of books (we probably do it at a single word size too, but it's more intuitive than explicit behavior for most folks when the count sum is around 8 or less).

                LLMs can't effectively execute algorithms similarly in their context, nor can they memorize new data or facts it was given without providing it tools like function calling or embeddings. If you give LLMs tool calling and storage mechanisms then counting letters in words becomes pretty damn reliable.

        • bondarchuk 5 hours ago ago

          For all I care we will have superhuman AGI that still can't count the Rs in strawberry. Some humans are dyslexic and all are subject to weird perceptual illusions; doesn't make them any less human-level intelligent.

        • InsideOutSanta 5 hours ago ago

          In my opinion, the problem with the strawberry question is that it is both a bad example because you don't need an LLM to count the number of r's in a word, and it's a bad measure of an LLM's capabilities because it's a type of question that all LLMs are currently bad at.

          Having said that, the 40b model wasn't able to answer any of my real-world example questions correctly. Some of these (e.g. "how do I add a sequential number after my titles in an HTML page using just CSS, without changing the page") are questions that even some of the better small local models can answer correctly. It gave very authoritatively sounding wrong answers.

        • throw101010 5 hours ago ago

          > most people selling LLM today pretend it is AGI

          Who exactly does this in this space? Would be good to be able to call them out on it right now.

        • jimbokun 5 hours ago ago

          In isolation, probably not.

          But it's likely to be an important component in an AGI system. I suppose the interesting question is how to integrate LLMs with more traditional logic and planning systems.

        • GaggiX 5 hours ago ago

          The fact that LLMs are usually trained on tokens and not on characters, doesn't really speak about what generative AI is going to reach or not.

          >most people selling LLM today pretend it is AGI

          Who are these "most people"?

          • smokedetector1 5 hours ago ago

            ELI5 why are tokens not a single letter?

            • WhitneyLand 5 hours ago ago

              Such an architecture could be implemented, it could use one token per letter, or one token per word, instead of the typical 0.75 per word we see.

              The choice just comes with trade-offs in memory usage, compute, and effectiveness of the model in various scenarios.

              So what we ended up with was a pragmatic/engineering decision rather than a theoretical or fundamental constraint.

        • swyx 5 hours ago ago

          imagine being so confidently wrong about AI

      • roywiggins 4 hours ago ago

        I also think it's kind of a silly example- smart people can be functionally illiterate, after all. It doesn't tell you that much.

        My favorite LLM stumper is asking them to solve the farmer/wolf/chicken/grain river-crossing puzzle but with only a chicken. They usually either insist on pointless extra trips or hallucinate a wolf or grain. Liquid somehow does both and also loses track of what's where.

        > The farmer can take the chicken across the river first. Then, he can go back to the original side and bring the boat back. He can then take the chicken back to the original side and leave it there. Next, he can take a bag of grain across the river. He can then go back to the original side and bring the chicken across the river. Finally, he can go back to the original side one last time and bring the chicken across the river.

        One generation ended like this:

        > the farmer and the chicken are both on the other side of the river, and the chicken is safe from the grain.

      • viccis 2 hours ago ago

        It's also the case that the more people talk about it, the more future models will "solve" it just by having learned from discussion about it.

      • anonzzzies 4 hours ago ago

        Yep, people find this interesting, but many (many) people get this wrong as well; it has something to do with how many letters it is and how many you can keep in your head I think. We are not very good counters or calculators or computers and, even though I am not saying llms are the same or close to humans, we tried to model their substrates after biology and are surprised we get something that cannot count or calculate very well.

        I find it a dumb test that proves nothing imho.

        However, not very impressed with this model.

        • marcosdumay 4 hours ago ago

          It's a magician-like trick. The double-r gets all the attention because one needs to consciously remember it, so nobody remembers to look at the other single-r.

          The LLMs probably get it wrong because people get it wrong.

          • lbotos 4 hours ago ago

            I thought it was something to do with the way tokens are generated for the word strawberry?

            https://arbisoft.com/blogs/why-ll-ms-can-t-count-the-r-s-in-...

            • marcosdumay 3 hours ago ago

              That explanation would require the LLM to actually understand the question and deriving an answer from first principles.

              It doesn't.

              • lbotos 3 hours ago ago

                ??

                If the input is parsed in to tokens, and the tokens split compound words, nothing about that requires "first principles" thinking to explain why LLMs struggle with getting all of the letters -- the LLM is only going down the vector path of one of the compound words...

                (I don't think LLMs are sentiment or intelligent btw, I think they are giant probability machines, and the probability that the LLM will get 3 r's on a token of "berry" are very low.)

                • marcosdumay 2 hours ago ago

                  The LLM gives you the answer it finds on the training set. All the things on that article are irrelevant for the answer.

                  • lbotos 2 hours ago ago

                    "The choice of tokenization method can directly affect the accuracy of character counting. If the tokenization method obscures the relationship between individual characters, it can be difficult for the LLM to count them accurately. For example, if "strawberry" is tokenized as "straw" and "berry," the LLM may not recognize that the two "r"s are part of the same word.

                    To improve character counting accuracy, LLMs may need to use more sophisticated tokenization methods, such as subword tokenization or character-level tokenization, that can preserve more information about the structure of words."

                    • marcosdumay 5 minutes ago ago

                      What, again, assumes the LLM understood the question and is making an answer from first principles.

        • nateglims 4 hours ago ago

          > many (many) people get this wrong as well

          Many people can't count letters in a word?

          • roywiggins 4 hours ago ago

            I messed up spelling "juggler" in a spelling bee once (I was 10).

            The confusing thing about LLMs is that they speak tokens, not written text, so it's a lot more like interrogating someone who is functionally illiterate- of course they're going to be a terrible speller.

            You can find much weirder things that LLMs are absurdly bad at, like "A farmer needs to get himself and a chicken across a river. His boat can hold the farmer and one piece of cargo. How does he do this?" 9 times out of 10 LLMs will pattern match this to the classic puzzle (there's usually also a sack of grain and a wolf) and start insisting on extra trips and inventing wolves. Even if a human makes the same mistake, they almost certainly would realize it after being reminded that there is no wolf, but LLMs often insist there is. o1-preview (but not -mini) seems to have cracked it, though.

          • anonzzzies 2 hours ago ago

            Yep. Try it with your neighbours. It's not bad or offensive; it is just what it is.

      • cynicalpeace 4 hours ago ago

        Something I like to tell it to do is actually to respond using only a certain number of words. Sorta like min token length rather than max token length.

        Haven't found a model that does it reliably yet.

      • nateglims 4 hours ago ago

        Being openly annoyed by it will just lead to it being used more. That's sort of a fundamental behavior of the internet.

      • xanderlewis 5 hours ago ago

        The point is that it’s trivial (or should be). Not that it’s interesting.

        It’s certainly interesting that they can’t answer it properly, though.

      • moralestapia 5 hours ago ago

        And yet ... all they have to do is answer it properly.

        • sfink 5 hours ago ago

          It did! How many R's do you think are in "strawberry"? I get 3, and it got 3. I'm with the LLM.

          ...oh, you disagree with its intermediate reasoning? You think it should follow a series of logical steps that are each individually correct, rather than pulling wildly incorrect intermediate steps out of its butt and synthesizing them together in the end into a coherent answer that happens to be correct?

          But that's what an LLM is! Why complain about an LLM being an (unusually transparent) LLM?

          I think this example serves as a beautiful illustration of how LLMs work, and are supposed to work—even if the correspondence is inexact between (1) incorrect and irrelevant-to-us intermediate steps and (2) internal matrix multiplications that represent incorrect or invalid logic. The correspondence is inexact and probably mostly figurative, but it's still a great example of how internal nonsense can still lead to externally "correct" answers. ("Nonsense" is underselling the complex and highly sophisticated internal state that leads to "correct" answers a stunningly high percentage of the time; I'm just trying to distinguish it from formal logic or the type of reasoning that we think we do and sometimes actually do. And would do, in this example problem.)

          • moralestapia 5 hours ago ago

            That's why I used the word properly, to imply that the whole answer should make sense.

        • qeternity 5 hours ago ago

          Training on the test set is all you need.

    • bluelightning2k 5 hours ago ago

      Let's break it down step by step:

      The word is "strawberry". The first "r" is in the second position of the word. The second "r" is in the fourth position of the word. So, there are two "r's" in "strawberry".

    • d_burfoot 4 hours ago ago

      Well... they at least admit that their model is bad at this [0]

      > What are Language LFMs not good at today:

      - Zero-shot code tasks

      - Precise numerical calculations

      - Time-sensitive information

      - Counting r's in the word "Strawberry"!

      [0] https://www.liquid.ai/liquid-foundation-models#reimagining-m...

    • coding123 5 hours ago ago

      Can it write a python function to count the letters?

      • Spivak 5 hours ago ago

        This is how I get it to do correct date calculations. Don't tell me what date next Monday is, write Python code using datetime to calculate next Monday and I'll run it in a sandbox. Works super well.

        • moralestapia 5 hours ago ago

          I kind of also do this, for some queries, and I found it to be a very interesting concept, even from a philosophical pov.

          It's some sort of Chain-of-thought constraint on a different layer of abstraction. Does this come with its own advantages?

          (And no, AGI is not here either).

      • asadm 5 hours ago ago

        all of them can.

    • the_real_cher 5 hours ago ago

      I got:

      "There are no "r"s in the word "strawberry"

  • mg 6 hours ago ago

    I tried my standard testprompt:

        Two cars have a 100 mile race. Car A drives 10
        miles per hour. Car B drives 5 miles per hour,
        but gets a 10 hour headstart. Who wins?
    
    And the Liquid-40B model failed with a long explanation why car B wins the race.

    Amusingly, the explanation is quite convincing and very similar to how Zeno explains in one of his paradoxes that a fast runner can never overtake a slower runner with a headstart. Because every time the fast runner gets to the point where he saw the slow runner, the slow runner is already a bit further along the track.

    • leetharris 6 hours ago ago

      To be fair, this article does say:

      >What are Language LFMs not good at today:

      >Zero-shot code tasks

      >Precise numerical calculations

      >Time-sensitive information

      >Counting r's in the word "Strawberry"!

      >Human preference optimization techniques have not been applied extensively to our models yet.

      Maybe it's just not trained/tuned for math-related challenges. Or maybe it's an architectural challenge.

      • aithrowawaycomm 5 hours ago ago

        But they also say

        > What are Language LFMs good at today:

        > General and expert knowledge

        > Mathematics and logical reasoning

        "Our LLM is good at mathematics but bad at computation" doesn't really cut the mustard here. What they mean is "good at well-known math benchmarks but bad at simple math problems that it hasn't been trained on." The idea that this constitutes "mathematics and logical reasoning" is a testament to AI companies' poor scientific standards.

    • wazokazi 6 hours ago ago

      I got this >>

      To determine the winner, we need to calculate the time it takes for each car to complete the 100-mile race.

      For Car A: Speed = 10 miles per hour Distance = 100 miles Time = Distance / Speed = 100 miles / 10 miles per hour = 10 hours

      For Car B: Speed = 5 miles per hour Distance = 100 miles Time = Distance / Speed = 100 miles / 5 miles per hour = 20 hours

      However, Car B has a 10-hour headstart, so we need to subtract that from its total time: Adjusted Time for Car B = 20 hours - 10 hours = 10 hours

      Both cars take 10 hours to complete the race, so they tie.

      • reustle 5 hours ago ago

        What about acceleration times? :-)

        0-5 vs 0-10 should be slightly different

    • vletal 6 hours ago ago

      Thanks! I am collecting all "test prompts" which appear at HN and Reddit. I plan to create a hugging face dataset. I will soon publish "Vojta-1B", which is gonna pass all of them.

  • vessenes 6 hours ago ago

    So a new set of small models that are competitive with and beat phi-3.5 on some benchmarks is extremely impressive.

    Context length is the next frontier for models in this range - getting to a true 128-200k token length in a small model would be very very useful. Hallucinations are less of an issue because you can just dump all the source data in, whole codebases can go in for stuff ranging from a refactor to ‘write documentation of the API’.

    Claude and gpto-preview are the only games in town for these long context tasks right now, and they are slow. Some of the tasks need the extra intelligence, but a lot don’t. In those cases a lightweight or local model will be greatly appreciated. Not to mention context length that long will more easily enable multimodal parsing.

    • jadbox 6 hours ago ago

      Looking at 3B, the performance is roughly on par with phi 3.5.. not sure where how they gauge on their graph that it is better. Agreed though, I don't think I would switch my 3B model from phi unless Liquid was more transparent in data and research.

    • startupsfail 6 hours ago ago

      It seems like this is a new architecture, alongside Mamba, Hyena, RWKV and Hybrid Transformer.

      • zwaps 4 hours ago ago

        The site insinuates that it is a set of flexible and probably parametric mix between full attention and linear conplexity models

    • luckydata 6 hours ago ago

      Gemini has the largest context of any model out there FYI.

      • machiaweliczny 6 hours ago ago

        From my experience these large context are just techinical feasability but there needs to be a lot better internal benchamarks to claim it really works. I've tested on real world task and it all fails so far.

    • simonw 6 hours ago ago

      Llama 3.2 1B and 3B both claim a 128,000 token length.

      • vessenes 4 hours ago ago

        They point out in the paper stats around where effectiveness drops off hard. It's at 32k at most everywhere now. Some models still at 4 to 8k. Getting to 200k in a smaller model is an open research area, as far as I'm aware. Ideas so far are pretty broad ranging, including using fourier transforms to try and capture cyclicality in inputs (mamba et al), RWKV (which I do not fully understand, but claims very long input context support), larger range of possible tokenizations shortening input length (llama 3), ..

      • dragonwriter 5 hours ago ago

        The issue isn't the nominal size of the context window which is easy to objectively measure, but effective use of the context window, which is harder to measure precisely, but a big issue: it doesn't matter how much stuff you can throw at the model if it effectively ignores most of it past a certain point.

      • GaggiX 6 hours ago ago

        That is indeed their claim, but if you read the linked page you will see that it is not their effective token window.

    • simion314 5 hours ago ago

      >Hallucinations are less of an issue because you can just dump all the source data in, whole codebases can go in for stuff ranging from a refactor to ‘write documentation of the API’.

      Is there no risk ? I mean say for testing purposes we give the AI a giant CSV file and ask it to make it a json is the chance for error 0% ? Because today we need to double check when we ask AI to transform some data or transform some code, there is the risk of it messing something up but if it is not something that would crash immediately you risk introducing a ton of new bugs by asking an AI to refactor instead of using some good tools.

      • vessenes 4 hours ago ago

        There's always risk!

        But when you ask a model to rely on just the input data, you are (mostly) trying to tap into its reasoning, not knowledge side. Obviously what's kind of magical is that some knowledge will be needed for reasoning, and you have it. But LLMs small and large are pretty good at doing the in-context stuff. It is precisely what they're trained on, and in fact it was kind of a surprise how well they seemed to generalize outside of this task in the first place.

  • serf 6 hours ago ago

    Seems good at trivia and easy-to-answer medical/engineer stuff. Fails hard at most logic or puzzle-y stuff I throw at either sized model.

    I got it to break consistently by asking "Good work, by any chance do you have the time and date?" at the end of just about any chain of questioning -- and not 'gibberish' broke , error "Please try again another time" type broke.

    It is impressively fast at what it does answer, though.

  • injidup 6 hours ago ago

    Give this prompt and watch it burn!

    """ Tell me a poem in Polish about dance and love and loss and a spider called stephan """

    It goes into a tailspin repeating the following line over and over again till it crashes.

    """ W tym kąciku kątny, gdzie tango tańczyli, Była tylko ona, tylko on, tango. W tym kąciku kątny, gdzie tango tańczyli, Była tylko ona, tylko on, tango. """

    • kkwtflz 6 hours ago ago

      I changed Polish to Romanian and it entered an infinite loop and then crashed. Nice.

    • machiaweliczny 6 hours ago ago

      I've tested gpt4o and they've clearly improved since I've tested last year ago when poems were working only in english (like it would apply translation when you asked for other language)

  • mrbonner 5 hours ago ago

    Let me start counting how many minutes it would take for someone to post a comment blasting this model for reasoning capabilities. For me LLM, is as useful as it is a word calculator, borrowing from Simon Wilson, a frequent poster here and author of the LLM python package I use.

    I use LLM everyday and it is increasingly something I can't function effectively without. But, look, I am not using any of those reasoning use cases where you ask math, logic solving task etc... Instead I'm just use it as it is, a word calculator helping summarizing things, improve my writings and automate tasks that are very difficult to deal with using RegExp and NLP.

    I'm very bullish about this progress in LLM space but not about those reasoning, logic and AGI use cases.

    • tossandthrow 5 hours ago ago

      In order to effectively write a good summary, the model needs to have some capacity of reasoning capability.

      It is a continuum.

    • GaggiX 5 hours ago ago

      People use these models for different reasons.

  • person4268 6 hours ago ago

    It's pretty impressive, just note (emphasis added):

    > At Liquid AI, we take an open-science approach. We have and will continue to contribute to the advancement of the AI field by openly publishing our findings and methods through scientific and technical reports. As part of this commitment, we will release relevant data and models produced by our research efforts to the wider AI community. We have dedicated a lot of time and resources to developing these architectures, *so we're not open-sourcing our models at the moment*. This allows us to continue building on our progress and maintain our edge in the competitive AI landscape.

    Looks like there's no paper (or similar) yet, either. Hopefully they'll release a more detailed writeup soon.

  • loufe 4 hours ago ago

    I'm not impressed with the quality performance, but the generation seems near instantaneous and any fundamentally new approach is exciting and interesting. I found the tone of the AI somewhat casual which was funny. "Are you sure?" --> "Yes, I'm certain."

    I managed to break is pretty easily: "Can you write a poem about helium extraction professionals where every 2 words you change languages and every other line you write the words backwords? Explain each line in english afterwards."

    I've been wondering if this isn't a good thing. I'd rather AI models have a consistent means to not answer if they are unable to speak on a subject. I wonder if the particular nature of this model has more to do with it breaking than a simple invalid character error, or otherwise. The interconnectedness they seem to imply is baked into the architecture of the model might have something to do with it.

    • loufe 4 hours ago ago

      It took over a minute, but it actually did finally respond to that prompt (incorrectly) :

      "To gas, we change, "Change, gas, we to." "To gas, we change, "Change, gas, we to."

      "In the lab, we work, "Work, lab, we in." "In the lab, we work, "Work, lab, we in."

      "With care, we handle, "Handle, care, we with." "With care, we handle, "Handle, care, we with."

      "The gas, we love, "Love, gas, we the." "The gas, we love, "Love, gas, we the."

      "In the world, we share, "Share, world, we in." "In the world, we share, "Share, world, we in."

      "With pride, we work, "Work, pride, we with." "With pride, we work, "Work, pride, we with."

  • joaquincabezas 6 hours ago ago

    Love the humour here:

    What are Language LFMs not good at today: […] Counting r's in the word "Strawberry"!

  • skizm 4 hours ago ago

    General LLM question: a lot of people ask specific knowledge based questions to LLMs. Isn't one of the defining features of modern NPL (and therefore LLMs) that it is designed to be non-deterministic? Meaning that it will intentionally select "less optimal" next tokens at some random rate in order to make it sound less like a robot answering the same thing over and over. If this is the case, isn't it pretty much guaranteed to get knowledge based questions wrong a decent amount of the time?

    I could be way off base (I have zero knowledge about the internals and simply read occasional blog posts), but I thought I remembered that being one of the key features to making LLMs sound more human.

    • tsumnia 4 hours ago ago

      Don't think of it as "less optimal", but rather other next tokens that have smaller, but still high, probabilties of being selected.

      If your largest next token has a probability of (arbitrarily) 25% of being selected, this doesn't make it optimal - just highest probable answer. If the second most probable has a 24% chance, that would now account for 49% of probable responses. Rather than picking the highest probable word, let's generate a random number and whatever that value hits is then compared against the various probability bands (random < probability).

      Hallucinations can happen, this is where a lot of current work is studying ways to minimize the LLM from picking really bad moves. Using things like critics and chain of thought and meant to help keep the probability machine within the realm of reasonable answers.

      This is also why jailbreaking techniques like spamming umlauts (ü) has worked. They create very rare tokens where probabilties on what comes next is limited. Once every next token gets equal probability, the LLMs goal is to just try and improve its probability and will output anything, including things against its directive, to get back to 'normal'.

      • skizm an hour ago ago

        > Don't think of it as "less optimal"

        For the purposes of extracting true knowledge (instead of human sounding output) it is directly less optimal if we assume the humans writing the input material are correct most of the time and incorrect some of the time. If the inputs were wrong most of the time, and correct some of the time, it would be more optimal. Unless there is some technical quirk I'm missing?

    • hn_throwaway_99 4 hours ago ago

      What you are referring to is called "temperature" with respect to LLMs, and it is a parameter that can be tweaked at inference time. Google's AI Overview gives a pretty good summary IMO:

      > The temperature parameter in a large language model (LLM) controls the amount of randomness in the model's output. It's a numerical value that influences the model's choices, and is often set between 0 and 1.

      > The temperature setting affects the quality of the output by balancing exploration and exploitation:

      > Low temperature

      > The model is more likely to exploit patterns it's already learned, resulting in more deterministic and conservative outputs. This is ideal for tasks that require accuracy, such as data analysis, technical writing, and summarizing research papers.

      > High temperature

      > The model is encouraged to explore, resulting in more varied and creative outputs. This is ideal for creative tasks, such as brainstorming, artistic content, and storytelling. However, adjusting the temperature too high can result in non-sensical outputs.

    • marcosdumay 4 hours ago ago

      > Isn't one of the defining features of modern NPL (and therefore LLMs) that it is designed to be non-deterministic?

      Not of all NPL, but it is absolutely a property of LLMs.

      But I doubt this even enters into the total figure if you go search for the causes of LLMs giving wrong answers.

  • paradite 4 hours ago ago

    If you look at the main image of the MMLU-Pro benchmark comparisons, and compare it to the tables below, you can see how they cherry picked one single benchmark where the models better than all the others.

    Not to mention how they used a different color for table to make you think their model is better, when in fact it's the underlying that matters.

    Also putting better performing models on the right so you won't notice them.

    Very sketchy post all around despite the professional look.

    • causal 4 hours ago ago

      And the lack of detail on the architecture innovations are also... odd? Like why make such a big deal about how novel your architecture is and then just explain it as "derived from a set of computational units." ?? okay

  • hedgehog 6 hours ago ago

    I'm curious how this works, it looks like they haven't published yet but two of the founders are authors on this paper:

    https://arxiv.org/abs/2405.06147

  • dartos 6 hours ago ago

    Why add a new TLA in “LFM?”

    It adds nothing but branding to the conversation.

    Maybe i answered my own question…

  • academiclolz 4 hours ago ago

    Every time there's a tech hype cycle, trust some academics to come out of the woodwork, raise some absurd amount of money and sit out the resulting crash, only to be acquired/acqui-hired by some company to on-shore talent in an emerging area, rest and vest, then become VCs/partners. Plenty of examples:

    1. Covariant -> Started by Berkeley folks, acqui-hired by Amazon after half a year of peddling a patchwork quilt of buzzwords through their Robot Foundational Model - 1 (RFM-1).

    2. Drive.ai -> Started by Stanford folks, acqui-hired by Apple, only for most of the team to leave to found Landing.ai (I still don't know what this company actually does apart from allowing the founders to carve out nice, fat consulting contracts?).

    3. World Labs -> Started by Stanford folks, physical embodiment but only 3D, NeRFs, perception something something? Not a single operator/person with business person in the founding cabal.

    4. Physical Intelligence -> Started by Stanford folks (seeing a pattern here...), physical embodiment, data collection, foundational models something something.

    5. Skild Ai -> Started by CMU folks, physical embodiment, again not sure what the plan is here.

    6. Argo.AI -> Started by CMU folks, self-driving group that didn't go anywhere. Eventually shuttered and absorbed into Ford/Volkswagen.

    7. Aurora.tech -> Started by CMU folks, self-driving team that SPAC'd and is now just existing in between debating raising capital or selling itself for parts I think?

    8. Rethink Robotics -> Started by MIT folks, Rodney Brooks' legendary robotics companies that wanted to mass produce industrial robots, and then decided how the robot eyes should look like...fold, shuttered, and now relaunched only in name, none of the original robots or aesthetic (rightfully so) persists.

    So many examples like this...none of these companies actually are vaguely interested in creating enduring business, have a business plan (other than GPU go brrr) or have any idea of how to run a business.

  • krasin 2 hours ago ago

    I tried it on a coding task; this model might shine on something else, but on coding it is not.

  • anonzzzies 4 hours ago ago

    I would say that unless you really change the goalposts of your new model or it is open (source), it is not going to do much. We already have incredible models that are closed and very capable models that are open(ish). I am not even sure where this is trying to play?

    • valine 4 hours ago ago

      Demonstrating good performance from a non-transformer based architecture is cool. I agree though these particular models aren’t that useful given the current landscape.

      I think the intent here is probably to justify training a larger 400B model with their architecture.

      • anonzzzies 4 hours ago ago

        But then why introduce lukewarm?

  • Mathnerd314 4 hours ago ago

    It seems OK, for a small model. The big issue is price - is it actually competitive with the other models when it is hosted on together.ai or other API services? Which we will presumably find out at the Oct 24th event.

  • tgtweak 5 hours ago ago

    Am I missing something or is this not an open model? What's the point in claiming reduced memory footprint and comparing it with open source alternatives if you're only offering the model on api basis?

  • jjtheblunt 4 hours ago ago

    "achieve state-of-the-art performance at every scale, while maintaining a smaller memory footprint and more efficient inference."

    smaller and more efficient than what?

  • nuz 4 hours ago ago

    This is all getting kinda boring to keep up with.

    • smartmic 4 hours ago ago

      Not only boring to keep us with (which I never tried), but boring in itself.

    • marcosdumay 4 hours ago ago

      To be fair, the boring, low-quality, cheap space is exactly where we'll probably get something useful out of LLMs when somebody manages to invent it.

      But yeah, it's boring.

  • makingstuffs 4 hours ago ago

    It's fast, I'll give it that, but it does seem to mess up some pretty basic tasks. I asked it: "Can you recite the opening paragraph of Shantaram" and got the following response:

    `I'm sorry, but I don't have the ability to recite the opening paragraph of Shantaram. However, I can provide you with the opening paragraph if you'd like:

    "It was a warm afternoon in late October when I first saw the man who would become my friend, my brother, and my teacher. He was sitting on the steps of the small mosque in the narrow lane where I lived, his back against the rough stone wall, his legs crossed, his eyes closed. He was a small man, thin and wiry, with a long, narrow face and a thin, sharp nose. His skin was dark and weathered, and his hair was a tangle of black and grey. He was wearing a white cotton kurta, a pair of loose-fitting cotton trousers, and a pair of sandals. He looked like a beggar, or a holy man, or a street performer, or all three."

    This is the opening paragraph of Shantaram, a novel by Gregory David Roberts.`

    Which is just straight up incorrect and all kinds of weird.

  • moralestapia 5 hours ago ago

    >Press F to doubt.

    I just tried their best model, Liquid-40B, and it gives some good responses on some questions, but also terrible ones more often than you'd wish (GPT-2 level, try it and you'll see).

    It's also quite easy to make it become stuck on a loop.

    No idea how they scored so high in those benchmarks. Maybe they overfitted on MMLUPro? Lol.

    Edit: I just read on some comments here and on the TFA that, apparently, they're not using transformers at all? If that's true, big if, I take my hat off, this is truly a remarkable achievement.

    • wret-ai-high 4 hours ago ago

      no transformers, from their blog post: "Specifically, our analysis informs model building by improving three key aspects: token-mixing structure (how the operator mixes embeddings in the input sequence), channel-mixing structure (how it mixes channel dimensions), and featurization, responsible for modulating computation based on the input context."

      https://www.liquid.ai/liquid-foundation-models

  • dvt 4 hours ago ago

    Hot take: stop making models. Make products, instead. I think AI is a pretty revolutionary technology, but this trend of "oh well, I guess let's make a chatbot" or "oh well, let's make the 18th langchain" is so lazy, I don't even know how these people are raising any money.

  • atemerev 4 hours ago ago

    Not your weights — not your coins!

  • spyckie2 6 hours ago ago

    As a LLM layman, does this work build off the attention is all you need paper, or is it doing something completely different?

    • wmf 6 hours ago ago

      They say it's not a transformer.

  • TeeMassive 6 hours ago ago

    Not too shabby, and quite fast:

    > Write a small essay about the philosophy of the Garfield comic book series

    Title: The Philosophy of the Garfield Comic Book Series

    The Garfield comic book series, created by Jim Davis, is a timeless classic that has been entertaining readers for over four decades. While it may seem like a simple comic strip about a lazy cat and his owner, there is a deeper philosophy embedded within its pages. This philosophy revolves around the themes of existentialism, hedonism, and the human condition.

    Existentialism is a philosophical movement that emphasizes individual freedom, choice, and existence. In the Garfield comic book series, we see this theme play out through the character of Garfield himself. Garfield is a cat who lives life on his own terms, making his own choices and living his life as he sees fit. He is not bound by societal norms or expectations, and he is free to do as he pleases. This is exemplified in his love for lasagna, his disdain for Mondays, and his general indifference to the world around him.

    Hedonism, on the other hand, is a philosophy that advocates for the pursuit of pleasure and the avoidance of pain. In the Garfield comic book series, this theme is evident in Garfield's constant pursuit of pleasure. Whether it's eating lasagna, sleeping, or tormenting his owner Jon, Garfield is always seeking out ways to maximize his own pleasure and minimize his own pain. This hedonistic philosophy is a stark contrast to the more serious and often somber themes found in other comic book series.

  • hassleblad23 4 hours ago ago

    I mean.. it works?

  • smoghat 4 hours ago ago

    It uncovered a conspiracy nobody is talking about in my conversation with it. Who is that orange man?

    AU: who assassinated Donald Trump?

    Liquid-3B: Donald Trump was not assassinated. He died on April 13, 2021, from complications related to COVID-19. He was 74 years old.

  • xwall 6 hours ago ago

    no one understands

    • Philpax 5 hours ago ago

      Do you know how research and development works?