Marcus AI Claims Dataset

(github.com)

61 points | by davegoldblatt 11 hours ago ago

59 comments

  • dvt 10 hours ago ago

    I'm still not sure I fully understand the methodology. For example if Marcus makes the claim: "OpenAI sucks!" why would OpenAI's blog ever corroborate that? The sources used are all AI company blogs (Anthropic, Google, OpenAI) filled with inoffensive corpo-speak likely written to be as middle-of-the-ground as possible. In fact, I'd need an A/B test to make sure the LLM itself can properly rate various claims (positive, negative, and neutral) against such corporate sludge.

    Small aside: I'm only bringing this up because last year I worked on a game where you had to solve various moral dilemmas in a 1v1 situation (think trolley experiment and one player says "flip the switch" and the other says "don't flip the switch")—the idea was to get an LLM to rate the arguments in a fun turn-based online game. I built it out, but I kind of gave up when I realize how absolutely awful the LLM was at actually rating arguments and their nuances. Who won legitimately felt more like rolling a dice than a verdict given by a real judge or a philosophy professor grading a paper. I put that project aside, but might do a Show HN at some point since the game is basically done.

    Adjudication[1]—which is the real meat of this project—is done in a very partial way and I genuniely see basically zero value. Why not crawl reddit (or HN)? I know that also has issues, but it at least has more variety of tone.

    [1] https://github.com/davegoldblatt/marcus-claims-dataset/blob/...

    • davegoldblatt 10 hours ago ago

      [flagged]

      • dvt 10 hours ago ago

        Gotcha', but I'm just trying to see the audit for how claim X was rated and based on what sources. If we're looking at the Claude logs, we have huge files that have things like this[1]:

            {"id": "claim_0081", "date": "2023-02-11", "claim": "Current Level 2 self-driving operates under easy conditions and is nowhere close to handling real-world complexity.", "type": "descriptive", "target": "Level 2 self-driving", "status": "supported", "horizon": null}
        
        Why is this supported? How is this supported? Waymo would probably disagree, etc. Here's another one:

            {"id": "claim_0083", "date": "2023-02-11", "claim": "Tesla's product naming ('Autopilot', 'Full Self Driving') misleads customers into thinking the cars are more capable than they are, potentially causing accidents and deaths.", "type": "causal", "target": "Tesla marketing", "status": "supported", "horizon": null}
        
        I fully agree that TSLA engages in all kinds of deceptive marketing, but to fully support the stunning claim that it potentially causes deaths is, uh, a bit much. I mean, at least tell me who's saying this. What's the provenance?

        If Claude itself rated the claims, which seems the be the case unless I'm totally off base, I fail to see how we're actually doing anything at all here. Right now I'm working on a local research agent, and I'm being absolutely meticulous about storing browsed webpages, snippets, etc. into short-term (session) LLM memory or a long-term (cross-session) SQLite db.

        [1] https://github.com/davegoldblatt/marcus-claims-dataset/blob/...

  • ripped_britches 10 hours ago ago

    Strange, I had the same thought about doing this exact exercise this weekend.

    I think the overall percentage is the wrong approach here.

    It’s easy to say a lot of things that are factually true or predictions that are inevitably true.

    However the more salient point with Gary Marcus is the one unforgivable thing he was wrong about and continues to double down on which is that deep learning is hitting a wall.

    Starting in early 2022 and going through today, there is still so much low hanging fruit with deep learning.

    Today’s LLM progress is mostly being made in RL. But world models are also still so early and they’re deep learning all the way down.

    It would be nice if he would just admit he was wrong.

    • lambdasquirrel 10 hours ago ago

      Depends on how you look at it. In terms of overcoming fundamental limitations, I would argue it has indeed hit a wall. ChatGPT is how old, but LLMs still can't actually count?

      But then, to your point, what does it matter, if they're still as useful as they are? Even at this stage, Claude Code makes Jira halfway bearable.

      Of course, we have to consider the devil's advocate as well. Most CEOs don't seem to be reporting great ROI on their "AI" investments.

    • emp17344 10 hours ago ago

      Is “world models” even a real thing, or just the latest AI buzzword?

    • 10 hours ago ago
      [deleted]
    • cl42 10 hours ago ago

      I'll add one more point. If you scroll through his Substack, a lot of his posts are incredibly negative and unproductive. I was (and continue to be) someone who cares deeply about responsible AI... But there's a difference between working on AI responsibly or pushing the debate, versus simply criticizing everything that is done as folly, useless, crap, etc.

    • davegoldblatt 10 hours ago ago

      [flagged]

  • camerons03 10 hours ago ago

    Piping a few hundred Substack posts through Claude and ChatGPT and slapping a "hybrid reconciliation layer" on top doesn't magically turn token prediction into empirical evidence.

    Someone is so thin-skinned about a single guy writing a skeptical Substack that they spent their weekend building a dual-pipeline automation tool, scraping four years of his writing, instead of just building a product that actually disproves him. I’m not saying I agree with everything the man says, but until a human actually verifies these verdicts, this is just burnt tokens.

  • albatross79 11 hours ago ago

    Sounds about right, boosters are always vaguely claiming he's been obviously and ridiculously wrong, but when you actually listen to him he's tracked very well with the state of AI. GPT 5 was supposed to be AGI remember?

    • logicprog 10 hours ago ago

      I think it's pretty clear that he hasn't kept up with the state of the art.

      He has no idea what coding agents are capable of or how useful they are; he doesn't pay attention to any of the contributions to math or science that these models are making; he continually assists that because agents aren't ready to face customers in uncontrolled environment, they're completely useless even for employees and workers; he just last year posted an article complaining that LLMs don't use web search to find information (he asked the information about a friend), when almost all of them do now, even in their default interfaces; he still thinks hallucinations are a problem with any weight in things like mathematics and programming where it's very easy to verify the types of things hallucinations would cause a problem with; I think he still adheres to the stochastic parrot mindset even though that's not even the most relevant part of their training anymore.

      Most importantly, although he seems to have made a single substack post making this argument, it doesn't seem to have really percolated through the rest of his thinking: that the cutting edge of LLMs right now, agents, are actually exactly the kind of neurosymbolic system, where neural networks provide an interface with the outside world and a creativity and problem-solving engine to provide the sort of fuzzy pattern matching and adaptability that is needed, while symbolic code-based systems ensure that guardrails are met, requirements are met, and for accurate information is provided and so on, that he wants. I think his objection might be that the problem is that the problem solving and reasoning engine at the core is still an LLM. But the thing is that you need the kind of pattern matching and flexibility and adaptability that you get from an LLM drive things, to have the end result be anything different than just an expert system with a slightly better natural language interface pasted on. And I think it's pretty clear at this point that expert systems are dead. They haven't done anything as remotely interesting or useful as what we're seeing LLMs do.

      I think like another commenter says that his whole stick is pointing out obviously true basic features of LLMs like that they hallucinate or don't perfectly adhere to prompt guardrails, or that there's too much hype in the industry right now, and a lot of the companies suck in a vaguely standard big tech Silicon Valley way, and extrapolating to some broader point, which is that everyone should have listened to him and done what he said when he wrote that book back in the 90s (iirc).

      • albatross79 10 hours ago ago

        I think his claim basically boils down to "if you're expecting AI, LLMs don't cut it". And I think he's basically right on that count. There's a lot of tooling and harnessing being put in place to course correct them on the job, and from the other angle standards are simply being lowered to accommodate them. So they can be made to be useful, but they're still not what you would want from an actual AI. Marcus wants to augment them with symbolic AI. I don't know how feasible that is, but he's not fundamentally against AI, he's just against the notion that LLMs are AI. Which given how they've been marketed and how the public is encouraged to think about them, is a worthwhile point to make.

        • logicprog 10 hours ago ago

          I used to be a Gary Marcus fan, but I guess what confuses me is...

          I'm not really sure at that point what 'actual' AI means?

          It seems like the definition of actual AI is something like perfect AI — it has to be fully observable, interpretable, reason perfectly, have perfect factual recall, continual learning, infinite context windows, perfect instruction following, and so on. I feel like at that point, maybe nothing could ever be 'actual' AI?

          We typically use AI to mean some kind of algorithm or program that lets computers do intellectual work that was previously considered to be the exclusive domain of humans, especially if it involves problem solving or pattern matching or reasoning. Just look at Donald Knuth's recent posts about what Claude was able to do — seems like AI to me?

          Yeah, it is in perfect AI, but it's still AI. And it's not clear to me that the imperfections that LLMs have mean that they can't be extremely useful and revolutionary as a form of AI. Yes, they make weird mistakes a lot, and they don't think at all like humans do. But I am of the opinion that there are a lot of forms of intelligence, and human intelligence is just one of them. And every kind of intelligence comes with its own different gamut of continual errors that it will tend to make, blind spots and biases. The fact that LLMs have issues that are different from the form of intelligence humans have and also different from what computers have issues with doesn't discount them from being intelligent to me.

          I also think the framing of agentic harnesses as being bolted onto LLM's in order to "make them useful", but agentic harness plus LLMs not counting as an AI system itself very odd — I think it's pretty clear to me at least that "the AI", if you want to talk about it, is the neurosymbolic cybernetic feedback system that combines the harness and the LLM.

          The LLM is only the sort of fuzzy pattern matching logic and creativity core; the harness provides verification feedback loops, the ability to interact with and explore the outside world, the ability to bring in programming language interpreters and so on in order to do more rigid symbolic logic, observability, systems for storing and recalling memory for continual learning, and so on, and I think a lot of these, especially feedback loops, resolve a lot of issues that LLMs seem to inherently face, such as hallucinations.

          Moreover, LLMs are now substantially trained with writing code and using tools and interacting with the world and existing in harnesses in mind. At this point, I would have to guess that more than half of their training is actually devoted to rewarding them for correctly using all of these symbolic tools and solving problems in a simulated world than just predicting the next token.

          I also think that LLMs, as a sort of core engine of an agentic harness, are allowing computers to do things we'd never really dreamed they could do before, that symbolic systems by themselves never really achieved, and as I said before, if you're looking for neurosymbolic AI — as Marcus says he is — then this is basically how it's going to have to look unless you want to fall down the expert system rabbit hole again.

          • albatross79 10 hours ago ago

            [flagged]

            • logicprog 9 hours ago ago

              That's because of a fundamental perceptual limitation — tokenization. That's like saying that humans aren't intelligent because we can't perceive UV light. That's a shallow gotcha at best, it doesn't get to the heart of the matter at all.

              Also, models will just write code for that sort of thing now.

              • albatross79 9 hours ago ago

                The heart of the matter is a system that produces correct output in a wide variety of scenarios, including ones outside our scope of experience or understanding. That would be a one definition of AI. LLMs are not that, and I don't understand the desperate insistence that they are. Just accept what it is. A good auto complete is useful, a crappy AI is a marketing scam.

                • logicprog 9 hours ago ago

                  > a system that produces correct output in a wide variety of scenarios, including ones outside our scope of experience or understanding

                  Where does this definition come from? I certainly don't agree with it, and I am not sure who does, besides yourself and Marcus. Also it seems that you're saying AI does, in fact, mean 'perfect' AI, basically?

                  > A good auto complete is useful, a crappy AI is a marketing scam.

                  LLM agents do a lot more than auto complete now, and using them less like auto complete and more like 'AI' (via agents) has actually made them more useful and less crappy! Also, I don't think framing how modern RLVR'd LLMs operate now as as auto complete even makes a whole lot of sense in the first place.

                  • albatross79 8 hours ago ago

                    It's a common sense definition that covers the broad features of AI as most people would have imagined it before LLMs came along. If you have a better one I'm open to it.

                    Also, agents are just LLMs with harnessing. Attaching a plow to a horse makes it more useful too, but it doesn't change the intrinsic nature of a horse.

              • XenophileJKO 9 hours ago ago

                At some point you just have to stop responding to these "stochastic parrot/auto-complete" people.

                It isn't worth your intellectual bandwidth. They will eventually understand or they won't (Which I'm not sure how that is going to work for them... but the Amish had to start somewhere I suppose..)

        • antonvs 9 hours ago ago

          > "if you're expecting AI, LLMs don't cut it". And I think he's basically right on that count.

          This is one of those comments whose truth value depends entirely on a constantly shifting definition of “AI”.

          The ability of modern models to functionally understand, answer questions, and make recommendations about software codebases is superhuman at this point, relative to most human software developers. What is that, if not artificial intelligence?

          Perhaps you’re thinking of something more like AGI, but even there the terminology is loaded and ambiguous. The models are general enough to answer questions well on a vast range of subjects, and they exhibit understanding (again, functionally speaking this is true - whether someone wants to call them stochastic parrots is beside the point.) The appellation of “intelligence” applies just as well as in the coding case, it’s artificial, and it’s general.

          > a worthwhile point to make.

          I disagree. Without clear, justified definitions, it’s an incoherent, poorly specified point that seems to be driven by a desire to maintain a specific conclusion regardless of the evidence.

          • albatross79 9 hours ago ago

            [flagged]

            • logicprog 9 hours ago ago

              I'm not sure what you mean by "everyone knew what AI meant"? It's always been an extremely vague term, since its inception — and LLMs do, actually, enable computers to do things I, at least, would've called basically AGI in 2020.

              • albatross79 9 hours ago ago

                There's nothing vague about it. Intelligence is understanding. I think you're just motivated to be confused because you want to fit this thing into that box and it doesn't really fit. So maybe output is an that matters, are we auto completers, can you be intelligent without knowing, etc etc.

                • logicprog 2 hours ago ago

                  I'm sorry, but very many systems have been called AI that don't have "understanding," and been very useful without that. I'm not sure when that was ever part of the definition? Even Wikipedia gives a definition similar to my own, and I think Karen Hao makes a very good historical case that the definition of AI has never been super clear in Empires of AI, and she's no fan of LLMs (or the field of AI in general, it seems), that there never really has particularly been a clear definition.

            • antonvs 9 hours ago ago

              > Everyone knew what AI meant before LLMs came along

              Please do tell.

              > Why is everyone on the ground floor trying to defend what the guy in the penthouse is telling the investing public?

              I'm talking about observable capabilities of the technology. There are facts of the matter there that have nothing to do with what tech CEOs say. Pointing out the reality of the situation is not "defending" anyone.

              If you take a "team sports" approach to reality where truth is a function of which side you're on, then you're not going to able to participate meaningfully in rational discussion.

              • albatross79 9 hours ago ago

                There's rational discussion and then there's motivated obfuscation. If your definition of AI is so loose that it applies to a system that can't count objects reliabily, then what is there to discuss? Is a broken clock intelligent because it tells the time correctly twice a day?

      • nurettin 10 hours ago ago

        > he doesn't pay attention to any of the contributions to math or science that these models are making

        Ok but why report PR pieces as evidence for LLMs being useful?

        These are tools that can possibly provide output that is eventually correct. It is the human behind the wheel doing the actual work.

        Give the tool to a lesser expert and you will get more garbage with fewer lucky shots.

        For the elite, it is a balancing act where more often than not, the cost of making LLM do the work is less than doing it yourself. If this percentage is above 90% of the time, the tool is useful.

  • bananaflag 11 hours ago ago

    Well the main thing he is known for is "it's all gonna crash" and that's a fact that this page admits he's wrong about.

    Everything else, yeah, he's right, and I never doubted. I agree LLMs are unreliable, insecure etc. But I don't deduce from that that they're gonna amount to nothing.

    • latexr 10 hours ago ago

      > Well the main thing he is known for is "it's all gonna crash" and that's a fact that this page admits he's wrong about.

      Have there been specific claims about when it’s going to crash? I find it hard to believe he claimed it was all going to crash by early 2026. Maybe I’m wrong, I haven’t read all of his posts. But neither did the author, they admit in the repo this is all LLM, nothing was verified by humans.

    • bayarearefugee 10 hours ago ago

      Obviously I don't speak for Gary Marcus, but I'd say the chances of an AI financial crash aren't really directly linked to whether or not LLMs amount to anything or nothing.

      From my perspective, it is basically guaranteed that LLMs will increasingly be seen as essential work tools for just about anyone doing knowledge work. So they won't amount to nothing.

      But it is not at all guaranteed that the frontier model companies who are currently burning billions of dollars chasing that will capture significant percentages of that value.

    • albatross79 11 hours ago ago

      Has he said they're going to amount to nothing?

  • atleastoptimal 11 hours ago ago

    His whole thing is to make obvious incontestable claims about AI (LLM’s make mistakes) and connect it to unfalsifiable grand prognostications (It’s all gonna crash… any day now). Its the same tactic any preacher who harps on about the impending apocalypse uses.

    • oh_my_goodness 10 hours ago ago

      What's your take on TFA then?

      • atleastoptimal 10 hours ago ago

        His increasing accuracy derives from him wisening up and criticizing LLM’s only in domains where there is a consensus, and keeping his more grand criticisms unfalsifiable.

      • 1123581321 10 hours ago ago

        I’m not the OP, and haven’t read much Marcus, but an analysis linking “chains of claims” together could be interesting, devaluing specific true claims when they are used to support a false claim. The claims dataset appears to evaluate each claim independently.

  • logicprog 10 hours ago ago

    I think the problem here is that most of his claims are obvious, uninteresting, and largely agreed with even by the biggest AI hype people, like that AI hallucinates, or that they don't perfectly follow guardrails in the system prompt, or that they can be prompt injected, or that Open AI's financials look bad.

    But then on the other hand, he completely ignores all of the developments in the field scaffolding around these systems in order to resolve these problems. All of the changes and developments in how these models are trained, all of the things they've actually been able to achieve and do, and basically all of the positive use cases and things that balance out his criticisms.

    Since he doesn't really talk about any of that, of course he doesn't make false claims about it, he just ignores it, implicitly creating a false picture.

    And then it is this false picture that he uses to justify his grandiose claims about how everyone should have listened to him about how to do AI and these systems are inevitably going to turn out to be useless and the whole industry is going to collapse and fully disappear and society is going to be ruined and so on.

    So, of course, it looks like, on the one hand, all of his specific claims about AI are perfectly correct, and on the other hand, that all of his grinder claims about what that implies or means about the industry you have turned out to be wrong, and that he spends much more time on the latter than the former.

    I think it is really crucial to emphasize that even though most of the individual claims he makes are correct, he spends much more time on the prognostications that are fundamentally not correct, or at least are very speculative right now. I think that's an indication of something gone very wrong with someone's epistemic and incentive situation.

  • barbarr 11 hours ago ago

    Can't wait to see Gary Marcus's rebuttal

    • oh_my_goodness 10 hours ago ago

      His rebuttal? His critics' rebuttal?

      • barbarr 10 hours ago ago

        I feel he'd want to rebut the use of an LLM for this task to begin with (i.e. find issues/nitpicks with the LLM judgment whether it said he's right or wrong)

    • dakolli 10 hours ago ago

      Why would he need to make rebuttal, they back up a lot of his claims and show he's becoming more accurate with each year that goes by. The ares where he's less accurate are largely prediction that haven't had time to come to fruition or exaggerated for rhetorical purposes, sometimes you need to use hyperbole to get through to people.

      I don't know if this will cause a ton of capital destruction, I doubt it, it will probably destroy a bunch of the slot machine/gambling addicts who are paying 5k a month on their credit cards thinking an autocomplete API is going to provide a profitable business.

      A large part of this is a scam, just like many aspects of crypto were scams while others were not, this hype is very similar to NFT/Crypto hype from 2018-2023. Yes, some things were born out of those industries that are genuinely useful, a lot were not, its the same with AI.

      Potential AI winter, I think there will be a "winter" just like crypto, but even during crypto's winter, some companies continued to operate and innovate but 90% disappeared. I believe the same thing will happen, and soon. Watch what happens to companies like Perplexity over the next 12-16 months lol.

  • cortesi 10 hours ago ago

    I'm sorry, but we can tell absolutely nothing about Gary Marcus from this. People should have a look at the final data:

    https://github.com/davegoldblatt/marcus-claims-dataset/blob/...

    Many of the "supported" claims here are vague, banal, obvious, or just opinion. E.g.

    "the general public hasn't quite realized what's not possible yet"

    "loads of things scale, but not at all"

    "To be sentient is to be aware of yourself in the world; LaMDA simply isn't."

    "To date, nobody, ever, has given a convincing and thorough account of how human children (and human children alone) learn language."

    "A cat holding a remote control shouldn't have a human hand."

    "What I didn't see last night was vision" (about Tesla Optimus)

  • d_silin 11 hours ago ago

    I don't think there will be any market crashes before major AI companies doing IPOs, and then for some time more (late 2027- mid 2028).

    • dakolli 10 hours ago ago

      Well, many will crash before they ever get to IPO. Phind closed shop last month a few weeks after raising millions. But yes, the areas where they are claiming he's wrong, have largely not had time to come to fruition. Lets reevaluate his claims about scams and markets in a year. I'd bet my net worth that Perplexity and similar wrapper products are acquired out of existence in <16 months.

  • latexr 11 hours ago ago

    > All verdicts are LLM-scored, not human-verified.

    In other words, could be all slop. Or maybe it’s not. Maybe it’s mixed. No one knows.

    • davegoldblatt 11 hours ago ago

      Fair critique. The methodology doc covers this: both pipelines agree on the high-confidence clusters (security vulnerabilities, bubble predictions) even though they disagree on edge cases. The repo is public specifically so people can spot-check. If you find a claim where the scoring is wrong, I'd genuinely like to know.

      • latexr 10 hours ago ago

        > If you find a claim where the scoring is wrong, I'd genuinely like to know.

        So you’re asking me to do the work you should have done in the first place? If you didn’t put any effort into it, why should I waste my time checking your non-work and correcting it to your credit?

        If you had actually put in the effort then sure, I’d be amenable to helping making this the best it can be. But you didn’t, so what’s the point? Why should anyone spend their time fixing other people’s slop?

    • downboots 10 hours ago ago

      I am curious whether claims are scored more accurately by LLMs when reviewed and edited by LLMs prior to posting the claim.

  • dakolli 11 hours ago ago

    In 2026, I feel like a painter in 2022 being screamed at non stop by people telling me my craft is soon to be dead, that NFTs are the future by people who are largely behaving like gambling addicts (like NFT people).

  • rvz 10 hours ago ago

    That's if you trust and believe that the LLMs themselves are 'correctly' scoring.

    I wouldn't immediately even agree with an assessment made by these LLMs if I were Gary Marcus as that could immediately contradict any of the claims he even made and falling into the trustworthy trap. I'd remain skeptical as ever...

    ...because this is the worst of the red flags that ultimately supports Gary's argument that the LLM results may be untrustworthy:

    All verdicts are LLM-scored, not human-verified.

    People should check for themselves and draw their own conclusions.

    > The crash hasn't come.

    yet.

  • nurettin 10 hours ago ago

    You need to be a special kind of troll to use LLMs to respond to someone whose entire online persona is built around "AI bubble".

  • hdgx63 10 hours ago ago

    Now do the Pentagon. Gary Marcus is uninteresting cause he has no Power over anything.

  • akssassin907 7 hours ago ago

    [flagged]

  • whattheheckheck 10 hours ago ago

    Now do this for every single person with actual power

  • bionhoward 11 hours ago ago

    surprisingly accurate! Is Gary the AI equivalent of the “nothing ever happens” guy?