The LLM Lobotomy?

(learn.microsoft.com)

136 points | by sgt3v a day ago ago

61 comments

  • esafak a day ago ago

    This was the perfect opportunity to share the evidence. I think undisclosed quantization is definitely a thing. We need benchmarks to be periodically re-evaluated to ward against this.

    Providers should keep timestamped models fixed, and assign modified versions a new timestamp, and price, if they want. The model with the "latest" tag could change over time, like a Docker image. Then we can make an informed decision over which version to use. Companies want to cost optimize their cake and eat it too.

    edit: I have the same complaint about my Google Home devices. The models they use today are indisputably worse than the ones they used five whole years ago. And features have been removed without notice. Qualitatively, the devices are no longer what I bought.

    • icyfox a day ago ago

      I guarantee you the weights are already versioned like you're describing. Each training run results in a static bundle of outputs and these are very much pinned (OpenAI has confirmed multiple times that they don't change the model weights once they issue a public release).

      > Not quantized. Weights are the same. If we did change the model, we’d release it as a new model with a new name in the API.”

      - [Ted Sanders](https://news.ycombinator.com/item?id=44242198) (OpenAI)

      The problem here is that most issues stem from broader infrastructure issues like numerical instability at inference time. Since this affects their whole service pipeline, the logic here can't really be encapsulated in a frozen environment like a Docker container. I suppose _technically_ they could maintain a separate inference cluster for each of their point releases, but that also means that previous models don't benefit from common infrastructure improvements / load balancing would be more difficult to shard across GPUs / might be logistically so hard to coordinate to effectively make it impossible.

      https://www.anthropic.com/engineering/a-postmortem-of-three-... https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

      • xg15 a day ago ago

        Sorry, but this makes no sense. Numerical instability would lead to random fluctuations in output quality, but not to a continuous slow decline like the OP described.

        Heard of similar experiences from RL acquaintances, where a prompt worked reliably for hundreds of requests per day for several months - and then suddenly the model started to make mistakes, ignore parts of the prompt, etc when a newer model was released.

        I agree, it doesn't have to be deliberate malice like intentionally nerfing a model to make people switch to the newer one - it might just be that less resources are allocated to the older model once the newer one is available and so the inference parameters change - but some effect at the release of a newer model seems to be there.

        • icyfox a day ago ago

          I'm responding to the parent comment who's suggesting we version control the "model" in Docker. There are infra reasons why companies don't do that. Numerical instability is one class of inference issues, but there can be other bugs in the stack separate from them intentionally changing the weights or switching to a quantized model.

          As for the original forum post:

          - Multiple numerical computation bugs can compound to make things worse (we saw this in the latest Anthropic post-mortum)

          - OP didn't provide any details on eval methodology, so I don't think it's worth speculating on this anecdotal report until we see more data

          • xg15 a day ago ago

            Good points. And I also agree we'd have to see the data that OP collected.

            If it indeed did show a slow decline over time and OpenAI did not change the weights, then something does not add up.

      • esafak a day ago ago

        That's a great point. However, I think we can treat the serving pipeline as part and parcel of the model, for practical purposes. So it is dishonest of companies to say they haven't changed the model while undertaking such cost optimizations that impair the models' effective intelligence.

    • gregsadetsky a day ago ago

      I commented on the forum asking Sarge whether they could share some of their test results.

      If they do, I think that it will add a lot to this conversation. Hope it happens!

      • gregsadetsky 15 hours ago ago

        Update: Sarge responded in the forum and added more information.

        I asked them to share data/dates as much as that’s possible - fingers crossed

    • colordrops a day ago ago

      In addition to quantization, I suspect the additions they make continually to their hidden system prompt for legal, business, and other reasons slowly degrade responses over time as well.

      • jonplackett a day ago ago

        This is quite similar to all the modifications intel had to do due to spectre - I bet those system prompts have grown exponentially.

  • briga a day ago ago

    I have a theory: all these people reporting degrading model quality over time aren't actually seeing model quality deteriorate. What they are actually doing is discovering that these models aren't as powerful as they initially thought (ie. expanding their sample size for judging how good the model is). The probabilistic nature of LLM produces a lot of confused thinking about how good a model is, just because a model produces nine excellent responses doesn't mean the tenth response won't be garbage.

    • vintermann a day ago ago

      They test specific prompts with temperature 0. It is of course possible that all their tests prompts were lucky, but still then, shouldn't you see an immediate drop followed by a flat or increasing line?

      Also, from what I understand from the article, it's not a difficult task but an easily machine checkable one, i.e. whether the output conforms to a specific format.

      • lostmsu a day ago ago

        With T=0 on the same model you should get the same exact output text. If they are not getting it, other environmental factors invalidate the test result.

      • Spivak a day ago ago

        If it was random luck, wouldn't you expect about half the answers to be better? Assuming the OP isn't lying I don't think there's much room for luck when you get all the questions wrong on a T/F test.

    • nothrabannosir a day ago ago

      TFA is about someone running the same test suite with 0 temperature and fixed inputs and fixtures on the same model over months on end.

      What’s missing is the actual evidence. Which I would love of course. But assuming they’re not actively lying, this is not as subjective as you suggest.

    • chaos_emergent a day ago ago

      Yes exactly, my theory is that the novelty of a new generation of LLMs’ performances tends to cause an inflation in peoples’ perceptions of the model, with a reversion to a better calibrated expectation over time. If the developer reported numerical evaluations that drifted over time, I’d be more convinced of model change.

    • zzzeek a day ago ago

      your theory does not hold up for this specific article as they carefully explained they are sending identical inputs into the model each time and observing progressively worse results with other variables unchanged. (though to be fair, others have noted they provided no replication details as to how they arrived at these results.)

    • gtsop a day ago ago

      I see your point but no, it's getting objectively worse. I have a similar experience of casually using chatgpt for various use cases, when 5 dropped i noticed it was very fast but oddly got some details off. As time moved on it became both slower and the output deteriorated.

    • yieldcrv a day ago ago

      fta: “I am glad I have proof of this with the test system”

      I think they have receipts, but did not post them there

      • Aurornis a day ago ago

        A lot of the claims I’ve seen have claimed to have proof, but details are never shared.

        Even a simple graph of the output would be better than nothing, but instead it’s just an empty claim.

        • yieldcrv a day ago ago

          That's been my experience too

          but I use local models and sometimes the same ones for years already, and the consistency and expectations there is noteworthy, while I also have doubts about the quality consistency I have from closed models in the cloud. I don't see these kind of complaints from people using local models, which undermines the idea that people were just wowed three months ago and less impressed now.

          so perhaps it's just a matter of transparency

          but I think there is consistent fine tuning occuring, alongside filters added and removed in an opaque way in front of the model

    • colordrops a day ago ago

      Did any of you read the article? They have a test framework that objectively shows the model getting worse over time.

      • Aurornis a day ago ago

        I read the article. No proof was included. Not even a graph of declining results.

        • colordrops 10 hours ago ago

          Ok fair, but not including the data is not the same as the article saying it was subjective "feel".

  • ProjectArcturis a day ago ago

    I'm confused why this is addressed to Azure instead of OpenAI. Isn't Azure just offering a wrapper around chatGPT?

    That said, I would also love to see some examples or data, instead of just "it's getting worse".

    • SBArbeit a day ago ago

      I know that OpenAI has made computing deals with other companies, and as time goes on, the percentage of inference that they run their models on will shift, but I doubt that much, if any, of that has moved from Microsoft Azure data centers yet, so that's not a reason for difference in model performance.

      With that said, Microsoft has a different level of responsibility, both to its customers and to its stakeholders, to provide safety than OpenAI or any other frontier provider. That's not a criticism of OpenAI or Anthropic or anyone else, who I believe are all trying their best to provide safe usage. (Well, other than xAI and Grok, for which the lack of safety is a feature, not a bug.)

      The risk to Microsoft of getting this wrong is simply higher than it is for other companies, and what's why they have a strong focus on Responsible AI (RAI) [1]. I don't know the details, but I have to assume there's a layer of RAI processing on models through Azure OpenAI that's not there for just using OpenAI models directly through the OpenAI API. That layer is valuable for the companies who choose to run their inference through Azure, who also want to maximize safety.

      I wonder if that's where some of the observed changes are coming from. I hope the commenter posts their proof for further inspection. It would help everyone.

      [1]: https://www.microsoft.com/en-us/ai/responsible-ai

    • SubiculumCode a day ago ago

      I don't remember where I saw it, but I remember a claim that Azure hosted models performed poorer than those hosted by openAI.

      • transcriptase a day ago ago

        Explains why the enterprise copilot ChatGPT wrapper that they shoehorn into every piece of office365 performs worse than a badly configured local LLM.

      • bongodongobob a day ago ago

        They most definitely do. They have been lobotomized in some way to be ultra corporate friendly. I can only use their M365 Copilot at work and it's absolute dogshit at writing code more than maybe 100 lines. It can barely write correct PowerShell. Luckily, I really only need it for quick and dirty short PS scripts.

        • sebazzz 16 hours ago ago

          I agree. I asked it for some help refactoring a database and some of the SQL is quite broken. It also doesn't help that their streaming code is broken so LLM responses sometimes end up broken in the web browser (both Firefox and Edge so it is not a browser issue), so you need to refresh after a response to make sure the LLMs response was not an indication of a drunk LLM.

  • juliangoldsmith a day ago ago

    I've been using Azure AI Foundry for an ongoing project, and have been extremely dissatisfied.

    The first issue I ran into was with them not supporting LLaMA for tool calls. Microsoft stated in February that they were working on it [0], and they were just closing the ticket because they were tracking it internally. I'm not sure why they've been unable to do what took me two hours in over six months, but I am sure they wouldn't be upset by me using the much more expensive OpenAI models.

    There are also consistent performance issues, even on small models, as mentioned elsewhere. This is with a rate on the order of one per minute. You can solve that with provisioned throughput units. The cheapest option is one of the GPT models, at a minimum of $10k/month (a bit under half the cost of just renting an A100 server). DeepSeek was a minimum of around $72k/month. I don't remember there being any other non-OpenAI models with a provisioned option.

    Given that current usage without provisioning is approximately in the single dollars per month, I have some doubts as to whether we'd be getting our money's worth having to provision capacity.

  • cush a day ago ago

    Is setting temperature to 0 even a valid way to measure LLM performance over time, all else equal?

    • criemen a day ago ago

      Even with temperature 0, the LLM output will not be deterministic. It will just have less randomness (not defined precisely) than with temperature 1. There was a recent post on the frontpage about fully deterministic sampling, but it turns out to be quite difficult.

      • visarga a day ago ago

        It's because batch size is dynamic. So a different batch size will change the output even on temp 0.

        • criemen 11 hours ago ago

          Batch size is dynamic, in MoE apparently the experts chosen depend on the batch (not only your single inference request, which sounds weird to me, but I'm just an end user), no one audited the inference pipeline for floating point nondeterminisms, and I'm not even sure that temperature 0 implies deterministic sampling (the quick math formula I found has e^(1/temp) which means that 0 is not a valid value anyways and would need some dealing with).

    • jonplackett a day ago ago

      It could be that performance on temp zero has declined but performance on a normal temp is the same or better.

      I wonder if temp zero would be more influenced by changes to the system prompt too. I can imagine it making responses more brittle.

    • Spivak a day ago ago

      I don't think it's a valid measure across models but, as in the OP, it's a great measure for when they mess with "the same model" behind the scenes.

      That being said we also do keep a test suite to check that model updates don't result in worse results for our users and it worked well enough. We had to skip a few versions of Sonnet because it stopped being able to complete tasks (on the same data) it could previously. I don't blame Anthropic, I would be crazy to assume that new models are a strict improvement across all tasks and domains.

      I do just wish they would stop depreciating old models, once you have something working to your satisfaction it would be nice to freeze it. Ah well, only for local models.

    • fortyseven a day ago ago

      I'd have assumed a fixed seed was used, but he doesn't mention that. Weird. Maybe he meant that?

  • mmh0000 a day ago ago

    I've noticed this with Claude Code recently. A few weeks ago, Claude was "amazing" in that I could feed it some context and a specification, and it could generate mostly correct code and refine it in a few prompts.

    Now, I can try the same things, and Claude gets it terribly wrong and works itself into problems it can't find its way out of.

    The cynical side of me thinks this is being done on purpose, not to save Anthropic money, but to make more money by burning tokens.

  • cjtrowbridge a day ago ago

    This brings up a point many will not be aware of. If you know the random seed and the prompt, and the hash of the model's binary file; the output is completely deterministic. You can use this information to check whether they are in fact swapping your requests out to cheaper models than what you're paying for. This level of auditability is a strong argument for using open-source, commodified models, because you can easily check if the vendor is ripping you off.

    • TZubiri a day ago ago

      Pretty sure this is wrong, requests are batched and size can affect the output, also gpus are highly parallel, there can be many race conditions.

      • TeMPOraL 21 hours ago ago

        Yup. Floating point math turns race conditions into numerical errors, reintroducing non-determinism regardless of inputs used.

  • gwynforthewyn a day ago ago

    What's the conversation that you're looking to have here? There are fairly widespread claims that GPT-5 is worse than 4, and that's what the help article you've linked to says. I'm not sure how this furthers dialog about or understanding of LLMs, though, it reads to _me_ like this question just reinforces a notion that lots of people already agree with.

    What's your aim here, sgt3v? I'd love to positively contribute, but I don't see how this link gets us anywhere.

    • bn-l a day ago ago

      Maybe to prompt more anecdotes on how gpt-$ is the money making gpt—where they gut quality and hold prices steady to reduce losses?

      I can tell you that the post describes is exactly what I’ve seen also: degraded performance and excruciatingly slow.

  • jug a day ago ago

    At least on OpenRouter, you can often verify what quant a provider is using for a particular model.

  • romperstomper a day ago ago

    Could it be a result of a caching of some sort? I suppose in case of LLM they can't make a direct cache but they could group prompts using embeddings and produce some most common result maybe? (this is just a theory)

  • bigchillin a day ago ago

    This is why we have open source. I noticed this with cursor, it’s not just an azure problem.

  • SirensOfTitan a day ago ago

    I’m convinced all of the major LLM providers silently quantize their models. The absolute worst was Google’s transition from Gemini 2.5 Pro 3-25 checkpoint to the May checkpoint, but I’ve noticed this effect with Claude and GPT over the years too.

    I couldn’t imagine relying on any closed models for a business because of this highly dishonest and deceptive practice.

    • bn-l a day ago ago

      You can be clever with language also. You can say “we never intentionally degrade model performance” and then claim you had no idea a quant would make perf worse because it was meant to make it better (faster).

  • ukFxqnLa2sBSBf6 a day ago ago

    It’s a good thing the author provided no data or examples. Otherwise, there might be something to actually talk about.

  • mehdibl a day ago ago

    Since when LLM become deterministic?

    • thomasmg a day ago ago

      LLM are just software + data and can be made deterministic, in the same way a pseudo random number generator can be made deterministic by using the same seed. For an LLM, you typically set temperature to 0, or set the random seed to the same value, run it on the same hardware (or emulation) or otherwise ensure the (floating point) calculations get the exact same results. I think that's it. In reality, yes it's not that easy, but it's possible.

      • mr_toad a day ago ago

        Unfortunately because floating point addition isn’t always associative, and because GPUs don’t always perform calculations in the same order you won’t always get the same result even with a temperature of zero.

  • ant6n a day ago ago

    I used to think running your own local model is silly because it’s slow and expensive, but the nerfing of ChatGPT and Gemini is so aggressive it’s starting to make a lot more sense. I want the smartest model, and I don’t want to second guess some potentially quantized black box.

  • zzzeek a day ago ago

    I'm sure MSFT will offer this person some upgraded API tier that somewhat improves the issues, though not terrifically, for only ten times the price.

  • bbminner a day ago ago

    Am I the only person who can sense the exact moment an LLM-written response kicked in? :) "sharing some of the test results/numbers you have would truly help cement this case!" - c'mon :)

    • gregsadetsky a day ago ago

      I actually 100% wrote that comment myself haha!! See https://news.ycombinator.com/item?id=45316437

      I think it would have sounded more reasonable in French, which is my actual native tongue. (i.e. I subconsciously translate from French when I'm writing in English)

      ((this comment was also written without AI!!)) :-)

      • bbminner 9 hours ago ago

        Oh, my honest apologies then, Greg! :) I am not a native speaker myself. And as far as i can tell, the phrasing is absolutely grammatically correct, but there's some quality to it that registers as LLM-speak to me.

        I wonder how the causal graph looks here: do people (esp those working with LLMs a lot) lean towards LLM-speak over time, or both LLMs and native speakers picked up this very particular sentence structure from a common source? (eg a large corpus of French-English translations in the same style?)

        • gregsadetsky 8 hours ago ago

          No apologies needed, but thanks for your kind words! I think that we’re all understandably “on edge” considering that so much content is now llm-generated, and it’s hard to know what’s real and what isn’t.

          I’ve been removing hyphens and bullet points from my own writing just to appear even less llm like! :)

          Great stylistic chicken and egg question! French definitely tends to use certain (I’m struggling to not say “fancier”) words even in informal contexts.

          I personally value using over-the-top ornate expressions in French: they both sound distinguished and a bit ridiculous, so I get to both ironically enjoy them and feel detached from them… but none of that really translates to casual English. :)

          Cheers