34 comments

  • wkat4242 3 hours ago ago

    LLMs were never designed for this. In Apple's language: "you're holding it wrong".

    It's an impressive technology but its limits are highly overlooked in the current hype cycle.

    AI researchers have known this from the start and won't be surprised by this because it was never intended to be able to do this.

    The problem is the customers who are impressed by the human-sounding bot (sounding human is exactly what an LLM is for) and mentally ascribe human skills and thought processes to it. And start using it for things it's not, like an oracle of knowledge, a reasoning engine or a mathematics expert.

    If you want to have knowledge, go to a search engine (a good one like kagi) which can be ai assisted like perplexity. If you want maths, go to Wolfram Alpha. For real reasoning we need a few more steps on the road to general AI.

    This is the problem with hypes. People think a tech is the be all end all for everything and no longer regard its limitations. The metaverse hype saw the same problem even though there's some niche usecases where it really shines.

    But now it's labelled as a flop because the overblown expectation of all the overhyped investors couldn't be met.

    What an LLM is great at is the human interaction part. But it needs to be backed by other types of AI that can actually handle the request and for many usecases this tech still needs to be invented. What we have here is a toy dashboard that looks like one of a real car, except it's not connected to one. The rest will come but it'll take a lot more time. Meanwhile making LLMs smarter will not really solve the problem that they're inherently not the tool for the job they're being used for.

    • namaria an hour ago ago

      The fatal mistake of this AI cycle was calling LLMs AIs, and publishing impressive chatbots before they were wired up with more useful stuff than passing the Turing test.

      It made some billionaires who will argue it was a tremendous idea. But in the long term I think it will cause another AI winter that will dry up funding for useful research that would take longer to mature. Or maybe it's just like fusion... promising on paper but so incredibly expensive to handle as to render it useless in practice.

    • sim7c00 3 hours ago ago

      llms arent a flop. they make for great chatbots when adequatly trained with billions of dollars. fish clearly go m000! chatgpt even has it in the name yet people are blinded by the greed of others...

  • gota 11 hours ago ago

    This seems to be a comprehensive repeat of the "Rot13" and "Mystery Blocks world" experiments as described by Prof. Subbarao Kambhampati

    Rot13 meaning that LLMs can't do Rot 3, 4, ..., n except for Rot13 (because that' in the training data)

    Mystery Blocks World being a trivial "translation" (by direct replacement of terms) of a simple Blocks World. The LLMs can solve the original, but not the "translation" - susprisingly, even when provided with the term replacements!

    Both are discussed in Prof. Subbarao's Machine Learning Street Talks episode

  • rahimnathwani 9 hours ago ago

    Discussed the day before yesterday: https://news.ycombinator.com/item?id=41823822

    And the day before that: https://news.ycombinator.com/item?id=41808683

  • airstrike 9 hours ago ago

    > OpenAI's ChatGPT-4o, for instance, dropped from 95.2 percent accuracy on GSM8K to a still-impressive 94.9 percent on GSM-Symbolic.

    In other words, ChatGPT continues to dominate. A 0.3% drop might as well be noise.

    Also the original, allegedly more expensive GPT-4 (can we call it ChatGPT-4og??) is conspicuously missing from the report...

  • jokoon 7 hours ago ago

    Finally, some people are using basic cognition science to evaluate AI

    Also they mapped an insect brain

    Seems like my several comments suggesting AI scientists should peek other fields, did get some attention.

    That probably makes me the most talented and insightful AI scientist on the planet.

  • bubble12345 10 hours ago ago

    I mean so far LLMs can't even do addition and multiplication of integers accurately. So we can't really expect too much in terms of logical reasoning.

    • boroboro4 9 hours ago ago

      Can you multiply 1682671 and 168363 without pen and paper? I can’t. LLMs can if you force them do it step by step, but can’t in one shot.

      • janalsncm 3 hours ago ago

        For logical reasoning tasks you should use pen and paper if necessary, not just say the first thing that comes to mind.

        Comparing one-shot LLM responses with what a human can do in their head doesn’t make much sense. If you ask a person, they would try to work out the answer using a logical process but fail due to a shortage of working memory.

        An LLM will fail at the task because it is trying to generate a response token by token, which doesn’t make any sense. The next digit in the number can only be determined by following a sequence of logical steps, not by sampling from a probability distribution of next tokens. If the model was really reasoning the probability for each incorrect digit would be zero.

      • Tainnor 8 hours ago ago

        No, but you can say "I don't know", "I can't do this in my head", "Why is this important?", "Let me get my calculator" or any other thing that is categorically more useful than just making up a result.

        • solveit 6 hours ago ago

          It's relatively trivial to get an LLM that does that and every big lab has one, even if they're not selling them.

          ChatGPT 4o as of right now just runs python code, which I guess is "Let me get my calculator", see https://chatgpt.com/share/670df313-9f88-8004-a137-22c302f8bf...).

          Claude 3.5 just... does the multiplication correctly by independently deciding to go step-by-step (don't see a convenient way to share conversations, but the prompt was just "What is 1682671* 168363?").

      • serf 9 hours ago ago

        it's a weird differentiation , part of how they do that is by reading back what they said - someone trained in doing so could essentially abuse this characteristic themselves to do the math in a simplified step by step way if they had perfect recall of what they said or wrote..

        in other words, for the LLMs that do that kind of thing well, like gpt-o1, don't they essentially also use 'a pen and paper'?

        • boroboro4 9 hours ago ago

          And this is very good comparison, because o1 indeed does multiply these numbers correctly...

          Ask LLMs without chain of thought built-in is the same as to ask people to multiply these numbers without pen and paper. And LLMs with chain of thought actually are capable of doing this math.

      • myflash13 6 hours ago ago

        Pen and paper? LLMs are literally a computer program that cannot compute.

        • moi2388 5 hours ago ago

          But it can call into systems that can do compute.

          Do you think your inner monologue is any different? Because it sure as hell isn’t the same system as the one doing math, or recognising faces, or storing or retrieving memories, to name a few

        • carlmr 5 hours ago ago

          The comparison makes sense though. We're trying to build an simulated brain. We want to create a brain that can think about math.

          And chain of thought is kind of like giving that brain some scratch space to figure out the problem.

          This simulated brain can't access multiplication instructions on the CPU directly. It has to do the computation via it's simulated neurons interacting.

          This is why it's not so surprising that this is an issue.

          • namaria an hour ago ago

            LLMs are not simulating brains in any capacity. The words 'neural network' shouldn't be taken at face value. A single human neuron can take quite a few 'neurons' and layers to simulate as a 'neural network'.

          • ulbu an hour ago ago

            Does it have an understanding of the strict rules that govern the problem and that it needs to produce a result that is in total accordance to them? (In accordance which is not 100%, but boolean) i.e., can it apply a function over a sentence?

            I don’t know, that’s why I ask.

            • ThunderSizzle 26 minutes ago ago

              The answer is sometimes. Typically it'll forget rules you've given it by the time it might be useful because of the memory limit of LLMs. Either way, you basically need to know it's hallucinating to you so you can keep applying more rules.

      • tanduv 4 hours ago ago

        yea, but I'm able to count the number of r's in 'strawberry' without second guessing myself

        • mewpmewp2 4 hours ago ago

          Except o1 can do that and previously gpt could also do it if you asked it to count character by character while keeping count.

      • blitzar 3 hours ago ago

        282399355737 - My answer is not wrong, I was hallucinating.

      • akomtu 5 hours ago ago

        LLMs have pen and paper: it's their output buffer, capped to a few KBs, which is far longer than necessary to multiply the two numbers.

        If you tell an LLM to explain how to multiply two numbers it will give a flawless textbook answer. However when you ask it to actually multiply the numbers it will fail. LLMs have all the knowledge in the world in their memory, but they can't connect that knowledge into a coherent picture.

        • namaria an hour ago ago

          They have codified human knowledge in human language, represented by arrays of numbers. They can't access that knowledge in any meaningful way, they can just shuffle numbers to give the illusion of cogency.

        • auggierose 2 hours ago ago

          Does that make an LLM the perfect academic?

  • cyanydeez 12 hours ago ago

    Best thing LLMs do is add to the theory of p-zombies among the population.

    Instead of the dead Internet theory, we should start finding what percent of the population is no better than a LLM.

    • jnwatson 10 hours ago ago

      The next question is what percent of jobs can be performed by such p-zombies? The answer is probably pretty high.

    • shermantanktop 10 hours ago ago

      Well, now your comment is part of the corpus that will get sucked up and trained on. So retroactively your comment won’t qualify as better, I guess.

      • Aeglaecia 4 hours ago ago

        all the more incentive to behave like an illitierate dick

    • Aeglaecia 4 hours ago ago

      its not a theory anymore, starting to notice reddit threads entirely ai generated from the content down to the comments

      • n_ary a minute ago ago

        What beneficial outcomes are gained from such behavior by the contributors?

        The real fun is in intellectual engagement, but if thread is generated by bots and commented by bots as well, all I can see is a fake depiction of activity.

        However, I understand that, my perspective of beneficial activity could be limited.

      • jprete an hour ago ago

        I believe this is true, but I'm curious to see an example, if you have one.

  • fungiblecog 2 hours ago ago

    No shit!