Update on Reflection-70B

(glaive.ai)

80 points | by mellosouls 2 days ago ago

50 comments

  • ipsum2 2 days ago ago

    Sahil Chaudhary of GlaiveAI perpetuated fraud, where he replaced the model that he "trained" with other backend ML providers. He still has not given a reason why "Claude" the string would be missing, just magically happened, despite the base model, Llama3.1-70B having no issues producing the text "Claude" nor the dataset missing the string "Claude"!

    Note that there was additional proof, besides missing the string "Claude", by matching the max number of tokens the model was able to produce. This is more technical, but chatGPT, Claude, Llama all have different tokenizers, so words are be broken up into different sections. The API consistently did NOT match the base model tokenizer (Llama), instead, producing the same number of tokens as Claude.

    Companies and individuals should probably avoid GlaiveAI and Matt Shumer less they get scammed too.

    • lolinder 2 days ago ago

      Sorry, I'm having trouble finding more information about this—what is the significance of the model being unable to produce the string "Claude"? Was this some sort of half-hearted censorship to prevent it from disclosing its name? Where can I read more?

    • nisten 2 days ago ago

      they're 140Gb folders, for each checkpoint, yes file corruption happens

      and as for the fraud part...it was an opensource model release that did not meet the claimed benchmarks when people tried to replicate it

      • bastawhiz 2 days ago ago

        The fraud part was multiple independent sources producing fairly indisputable evidence that their "hosted" version of the model was just running GPT and Claude. That alone is enough to completely discredit absolutely everything about this work.

        As for corruption, I don't believe the excuse "yes file corruption happens". They're model weights. If this was trained (in real life) it was done on some serious hardware with disks with error correction. They weren't storing the checkpoints on microSD cards. It's certainly possible that there was really unfortunate luck and there was corruption, but I don't find that excuse to be plausible. Especially when this is your business (and launch!)

      • ipsum2 2 days ago ago

        Definition of fraud, from Google:

        * wrongful or criminal deception intended to result in financial or personal gain.

        * a person or thing intended to deceive others, typically by unjustifiably claiming or being credited with accomplishments or qualities.

        Since they were advertising GlaiveAI as this magical source of data where they trained a model that performed better than Claude and chatGPT, I think this firmly falls into that camp! Your definitions may be different than mine.

        • nisten 2 days ago ago

          it was a free opensource model release, the api was not for sale, there are literally over a million FREE models on huggingface.

          • alsodumb 2 days ago ago

            Who cares if the model was free? No one said they were trying to commit fraud by releasing that model, they were trying to commit fraud by subtly advertising that their companies/products had the secret sauce to make state-of-the-art models which they obviously didn't.

      • alsodumb 2 days ago ago

        Are you telling me someone trained a huge model, and served it for hours to tons of users, and had only one instance of the checkpoint? I call BS.

        The model being open-source doesn't mean what they could have gotten away with, or tried to, isn't fraud.

        • bhouston 2 days ago ago

          He served tons of people from his personal laptop? How is that possible? A 70B LLM is pretty taxing even to serve a single user let along the crush of users that tried out this new hyped model no? What am I missing?

          • 2 days ago ago
            [deleted]
  • coolspot 2 days ago ago

    On one hand, I want to believe Sahil, on the other hand most of his explanations don’t make much sense:

    Can’t upload exact weights he had on his computer. The guy runs AI hosting/inference/training company - can’t upload weights he has!

    Original benchmark harness wasn’t shared, but had a bug that conveniently boosted model results.

    API somehow mysteriously censors model name and tokenizer is exact match to Claude.

    • ameliaquining 2 days ago ago

      He seems to be claiming that anyone can now reproduce the weird Claude censorship locally with the uploaded weights. Has anyone checked whether that's true or not, or is he mischaracterizing the allegations?

      • xena 2 days ago ago

        I'm going to be downloading the weights and doing local verification

        • BoorishBears 2 days ago ago

          I think the most damning thing about this whole saga for all of AI is how much energy and attention people are giving it.

          In most established verticals, such a cartoonish scam would be dead on arrival. But apparently generative AI is still not mature enough to just move past this kind of garbage in a clean break.

          • xena 2 days ago ago

            To be fair, the AI industry is used to people manifesting out of nowhere doing something stupid and then ending up with revolutionary results. It's no surprise that there's a default optimism (especially if it pans out because then that makes running high quality AI stuff so much cheaper).

          • bubaumba 2 days ago ago

            > I think the most damning thing about this whole saga for all of AI is how much energy and attention people are giving it.

            That's because there is nothing better today, and nothing like it in the history.

          • lostmsu a day ago ago

            I think it is damning of the people who aren't paying attention, because this stuff at this trajectory is gonna be world changing pretty soon.

          • refulgentis 2 days ago ago

            It's not a cartoonish scam, and if it was, it took 48 hours to fall apart. Not worth getting the Jump to Conclusions™ mat out for.

            This isn't said aggressively or to label, but rather, to provide some context that it's probably not nearly as simple as you are suggesting: this thread looks like a bunch of confused engineers linking drama threads from laymen on Twitter/Reddit to eachother, seeing pitchforks, and getting out their own. Meanwhile, the harsh conclusions they jump to are belied by A) having engineering knowledge _and_ looking into their claims B) reading TFA

    • all2 2 days ago ago

      I've seen stuff like this hacked together. If he isn't very organized or was hasty, there's a good bet he deleted the working weights or doesn't know which of 5 or 10 the weights it is.

      Nothing would stop him from uploading all the weights, I suppose...

      • ipsum2 2 days ago ago

        No. He served the "weights" (actually Claude) for over 24 hours. It's practically impossible to have served the "correct weights" and just have lost them.

        • Havoc 2 days ago ago

          >It's practically impossible to have served the "correct weights" and just have lost them.

          Deleting files is very much a thing

          • minimaxir 2 days ago ago

            The AI dog ate his homework?

  • thorum 2 days ago ago

    > This along with a few tokenizer related tests people ran, made people suspect that we are just serving Claude with post-processing where we filter out words like Claude.

    Didn't these "few tokenizer related tests" prove the API was using Claude's tokenizer instead of Llama's, based on how words were being divided into tokens?

    That's a hard one to explain (it doesn't appear they're even trying to).

    • refulgentis 2 days ago ago

      People keep asserting that but, really, it was just people pointing to setting max tokens to a certain value and getting a certain # of words out. They didn't actually have tokens. Perfectly possible to have collisions, I'd wager even likely in the scenarios they tested, simple question, < 10 tokens, in English.

  • Havoc 2 days ago ago

    An expensive lesson in how fragile reputations can be

  • bhouston 2 days ago ago

    I am confused. He was hosting the 70B LLM everyone was demoing from his laptop? How can that serve the load? When I’ve run LLMs locally it is really taxing for just one concurrent session.

  • nisten 2 days ago ago

    Has anyone here actually ran the code on their own hardware yet?

    I did a standard non-middleware lm_eval_harness and got 0.3214 on gpqa_main_zeroshot WITH the systemprompt and 0.3616 without the systemprompt.

    Have not ran it with the middleware yet that's supposed to do the substraction. Now, if that adds 20% to the score, that would be a huge deal, but it would also roughly match the jump from gpt4o to o1-preview that they got in gpqa_diamond.

  • kristianp 2 days ago ago

    If this is for real, in some ways it shows how small OpenAIs moat is. Once someone knows something is possible and the rough idea, the community can replicate it in 4 weeks.

    • jsheard 2 days ago ago

      Isn't Reflection supposed to be based on CoT like o1? It was originally released a week before o1 was, so if it was the real deal all along then OpenAI were outright beaten to the punch rather than replicated after the fact.

      • ipsum2 2 days ago ago

        No. CoT has been around for several years (Jan 2022) https://arxiv.org/abs/2201.11903. And so has Reflection (March 2023) https://arxiv.org/abs/2303.11366. This approach that was taken by Reflection is nothing new.

      • thorum 2 days ago ago

        CoT moderately improves model performance, but all non-o1 models suck at actually thinking step by step effectively. If the task is not straightforward, they make obvious mistakes or default to guessing.

        OpenAI trained o1 to pick better steps in its chain of thought. (The moat is the dataset they used to do that.)

      • bastawhiz 2 days ago ago

        Maybe, if it wasn't an outright fraud. Arguably they didn't beat anyone to anything.

        • refulgentis 2 days ago ago

          > Maybe, if it wasn't an outright fraud.

          I mean, it obviously wasn't, did you read the thing we're commenting on? n.b. At this point, you have all you need to replicate it. Far shy of "outright fraud", though, I'm sure there's a bailey for that motte.

          • bastawhiz a day ago ago

            It's indisputably fraud. Multiple people after the original launch showed strong evidence that the hosted model they produced was simply proxying Claude and prompted it to censor its own name. It genuinely doesn't matter what they say, they committed blatant fraud, went out of their way to hide it, and now they're pretending like that didn't happen.

            The results might be perfectly reproducible, but their reputation is completely burned. This is not how you launch your company.

            Even if you don't care they didn't release anything at all of substance before the o1 launch. They didn't release usable weights, they didn't ship a working hosted model of their own. So no, they didn't beat OpenAI to anything.

            • refulgentis a day ago ago

              > showed strong evidence that the hosted model they produced was simply proxying Claude and prompted it to censor its own name.

              s/strong/extremely weak from my perspective, also, see article

              > They didn't release usable weights,

              Yes they did. They just weren't benchmarking the same as the initial claim.

              > they didn't ship a working hosted model of their own.

              Yes they did. They just weren't benchmarking the same as the initial claim.

              > So no, they didn't beat OpenAI to anything.

              Not sure where the idea they "beat OpenAI" is coming from, certainly not from me. I agree they did not.

              > It's indisputably fraud.

              This is indisputably incorrect, as I am disputing it.

              Happy to talk it out, don't take my shortness as being disagreeable. In general, people handwave about tokenizer[1] or "Claude" missing in a response[2]. I honestly expected the HN thread here to be far more insightful, instead, I'm seeing that its indisputable it was fraud, based on repeating a couple observations gooner teens made last week, then made vast conclusions on. Which were obviously wrong if you looked at it as an engineer.

              [1] no one can get the actual tokens out of an API. gooner local LLM stans were setting max tokens to some number <= 10, asking the same question of both, and seeing answers of similar length. This is mundane and expected, especially in english, at such a short length. I expected technical observers to, at least even if they don't grok tokenization, to note they weren't able to get the same responses with temperature = 0.0.

              [2] covered in article

              • bastawhiz a day ago ago

                I'm really not going to argue with you, because when faced with lots of little bits of compelling evidence from a whole bunch of sources (showing their work) versus the word of some guy on the Internet, I'll believe the evidence. Sahil didn't actually refute any of the concerns around the hosted model, he just acknowledged that it was weird and said he didn't know why. Great. That's useless. So much for "looking at it as an engineer".

                But what's great about the passage of time is that people can actually take what's presented in the article and try to replicate the benchmarks. And now that it's Friday, October 4th, and we've got this gem:

                https://x.com/mattshumer_/status/1842313328166907995

                So frankly it's still a fraud, now because even after the postmortem the results are still not reproducible. That's the whole point, right? That it does what it says on the tin. And it doesn't. This whole process could be a shell script that downloads and runs. They've had more time than they should need. Now, it's gone from a shell game to plain old academic dishonesty. If this was a published paper, it would be ripe for retraction.

      • nisten 2 days ago ago

        yes and the massive increase to GPQA scores from o1 was attributed to this technique, so there is something there, despite the hard feelings of unproductive reddit users

    • 2 days ago ago
      [deleted]
  • blackeyeblitzar 2 days ago ago

    Past discussions about this model and its controversies:

    Reflection 70B, the top open-source model https://news.ycombinator.com/item?id=41459781

    Confirmed: Reflection 70B's official API is a wrapper for Sonnet 3.5 https://news.ycombinator.com/item?id=41484981

    • nisten 2 days ago ago

      There is no official API, you're confirming a temporary one that was taken down weeks ago. That was done via OpenRouter, a routing api service/site, which routes to different model's under load.

      Yes they could've switched it themselves too.

  • alsodumb 2 days ago ago

    I don't trust Sahil and Matt. They tried to commit fraud, hype things up, but it got way too much attention than what they expected, so they tried to get away with just serving Claude/ChatGPT in the background but got caught. They are nothing but grifters who got caught and now trying to fix that image.

  • m3kw9 2 days ago ago

    Is either fraud or incompetency, I say is just incompetency like what he said. Got too excited on some false test maybe they tested with some validation data mixed in

    • bastawhiz 2 days ago ago

      So when they put the hosted model online (which was actually just proxying Claude), they explicitly prompted Claude to censor its own name. That's not explainable with incompetence. It's very intentional deception.

      • jazzyjackson 2 days ago ago

        I just can't facepalm enough seeing so called AI companies relying on pythons .replace() when they need to hide what service they're building on

  • daguava 2 days ago ago

    [dead]

  • ilaksh 2 days ago ago

    The models are very powerful. This can help anyone, including scammers. The number of scams will be enormous.

    I just hope people are also able to see how useful it is for non-scumbags and don't let the scammers ruin it for everyone else. I am not going to mention another similar category of technology in this regard, just to stay "politically correct" for this site.

    • talldayo 2 days ago ago

      > I just hope people are also able to see how useful it is for non-scumbags and don't let the scammers ruin it for everyone else.

      I hope so too - it's been four years since GPT-3 came out and I haven't found a single serious time-saving application for the technology.

      If someone doesn't start making money with LLMs soon, then it will only be the scammers who benefit!