GLM 5.2 vs. Opus

(techstackups.com)

386 points | by ritzaco 11 hours ago ago

271 comments

  • cultofmetatron 10 hours ago ago

    I seriously dont' know all this big hullabaloo about one shot prompting.

    by definition, a single prompt wont' constitute the complexity of a software project. ergo, what you'll get is a series of assumptions made by the model based on preexisting code in its training corpus.

    I'd rather see a coding agent that can follow steps in a plan file to a T while following guardrails and adhering to the proper coding conventions in the human reviewed spec.

    Id rather see performance in agent loops against human defined objectives where it can be verified to stick to defined guardrails and continue without drift till its objectives are complete.

    I'd also like to see it identify bugs and potential performance increases by identifying existing code and suggesting refactors based on context it can pickup about the particular use case you are trying to create.

    These are way more valuable metrics than "hey build X"

    • post-it 5 hours ago ago

      The streetlight effect:

      > A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, "this is where the light is"

      All of your suggestions are better but they're hard, so someone casually evaluating an AI isn't going to do them.

      • sanderjd 5 hours ago ago

        Sure, for casual evaluation, I agree. But are there serious analyses that are evaluating this kind of thing? I mean, these are the kinds of things I evaluate in my own work when a new model comes out, or when I'm evaluating a harness. But this is all very ad hoc and intuitional. I'd love to start bringing rigor to it, but I haven't found much prior art on this. In another thread someone said that's because it's probably impossible to do this rigorously because too much of it is subjective. And that does match my intuition. But I continue to suspect that intuition is wrong.

        • jerf 4 hours ago ago

          It's hard to bring much rigor to it. I'm not saying impossible, but it's not like it's completely obvious how to do it and people are just too lazy. Intrinsically, if I'm going to test a back-and-forth with a model I have a human in the loop making frequent decisions. Did the model fail or succeed at whatever rate it did that because of the model or the human? Did the testing protocols capture the actual problem, e.g., maybe if the model was given some particular bit of information that a normal human would have given it it would have done much better or worse, but the testing protocol in the interests of "rigor" excluded the human in the loop from doing it. Is the human going to be willing to sit down and do the same task 25 times, refreshing the model from scratch each time for a "valid" test? Can you get the same human to analyze every model in the test? Is their 10th pass of the problem an invalid test because you can't as easily erase the human's knowledge of the previous 9 tests? What do you do with a model that succeeds wildly 75% of the time and spins off into a loop the other 25%? Is that loop real or, again, did your "rigorous" testing protocol prevent the human from saving the model from the loop like any developer would?

          And so on and so forth. Again, I'm not saying this is impossible but I am saying that if you tried to do it, and you got the money, and you built the test, and got the human subjects clearance, and you ignored that during the process of all that at least one more frontier model would come out, you can count on HN anklebiting your "rigorous" study even so, and probably being correct about a lot of the issues it could have because it would take several iterations of this to build a reasonable protocol... at which point it would quite possibly also be obsoleted by progress again.

        • abhgh 4 hours ago ago

          You usually see this kind of analyses in conference papers, esp. if they have a datasets track. The NeurIPS Datasets & Benchmarks (D&B) track is a good example. But you will have to monitor the proceedings yourself closely - there is little chance of being accidentally exposed to them, because most blogs, announcements and popular media only mention a handful of the popular ones, e.g., Tau^2. For ex., across the years 2022, 2023 and 2024, 900+ papers were accepted in the D&B track [1] - of course, not all of them are LLM-related. I find them interesting because they often focus on specific system behaviors, and like you said, study them scientifically, so you can draw authoritative conclusions (or at least know specifically what part of a model's behavior you now know about, and what parts you don't).

          [1] https://blog.neurips.cc/2025/09/30/reflecting-on-the-2025-re...

        • rileyphone 2 hours ago ago

          DeepSWE is closer to that

          https://deepswe.datacurve.ai/

      • echelon 4 hours ago ago

        The minute an open model breaks through and beats Claude Opus/Fable, it's over.

        There are far more opportunities that can be served when the world's intellectuals have the raw weights and can fine tune, splice, distill, and reapply.

        Imagine having raw unfettered access to Fable. It can be refit to structural biology. It can be fine tuned on the repo for smaller context requirements. It can be run cheaper and air gapped.

        The world wants this.

        • digitaltrees an hour ago ago

          I don’t think we need them. I think the models we have are good enough. It’s the orchestration layer that makes the biggest difference at this point. The open source models we have are capable of calling tools and the work is getting them to be capable enough to know which tools to call and what to do in response.

          I think we are leaving the main frame era of AI and entering the PC era already. If there wasn’t a RAM shortage and we all had 2TB of ram and GPUs we would all have large local models or personal APIs serving our teams.

          That’s why all the labs are moving to the App layer and moving away from being the API for intelligence like they were originally.

          • wahnfrieden an hour ago ago

            They are absolutely not good enough

        • barrenko 4 hours ago ago

          As crazy as this sounds, and as much I don't want to believe it myself, I think we're still underestimating LLMs, and we're gonna get to that point pretty soon.

        • jupr 4 hours ago ago

          The world does want this. Opus capabilities, in a box, securely tunneled to my family and I utilizing the resources I already have available to me which is, energy + network.

      • newaccountman2 5 hours ago ago

        Feels a rather outdated little parable, since nowadays one would expect the police officer to either arrest or shoot the person.

        • blanched 4 hours ago ago

          This kind of hamfisted snark tends to make people take the actual and justified criticism of police less seriously.

          • newaccountman2 4 hours ago ago

            If people were willing to take it seriously in the first place, then they wouldn't view it as "hamfisted snark"

            • blanched 3 hours ago ago

              I consider myself someone who takes it seriously, and have spent time and resources fighting for change. But it’s wholly unrelated to this particular thread, phenomenon, and story. So having a little “ha ha” moment accomplishes nothing towards the actual cause. It makes people uncomfortable, but not the useful kind of uncomfortable.

              That said, maybe we just disagree on how to drive change, and that’s fine. I’ll leave it.

        • post-it 4 hours ago ago

          It could be a taxi driver if you like. Or an anarchist passing by on xir way to a protest.

        • layer8 3 hours ago ago

          …in the US.

    • gertlabs an hour ago ago

      One-shot performance often translates to the most difficult problems a model will be able to understand. We run an evaluation that tests both agentic and one-shot performance, and we find that Chinese models are almost universally very good at using tools and a harness to iterate towards a better solution, whereas their initial response ranks relatively low.

      Compare that to Gemini models, which have impressive fluid intelligence on the first response, but fail to call tools or explore correctly which limits their usefulness for agentic coding.

      Neither will be great for coding in a computational chemistry repo for different reasons, but the model with strong one-shot performance will be less likely to make subtle errors indicative of poor understanding, so we weight both capabilities into their final score.

      The latest Anthropic and OpenAI models excel in both domains.

      Data at https://gertlabs.com/rankings

    • rdsubhas 7 hours ago ago

      IMHO, It's not the oneshotting.

      It's the "starting from empty slate" greenfield that's the real problem.

      We used to make fun of Engineers who follow a README on a framework, test it on an empty project, and say "this framework is the best for our 10 year running production app". Greenfield mentality is always the solution to all problems and problem to all solutions.

      One should still measure oneshotting, it's an important self-measurement metric - but against an established, large codebase.

      • keheliya 7 hours ago ago

        There are upcoming benchmarks aimed at measuring the ability to work with brownfield tasks. (Of course, benchmarks can be gamed, but they are still better than unrealistic toy tasks that earlier generations of benchmarks used. Frontier labs are yet to use them in their tech reports or marketing material, though.:-)

        * SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios https://arxiv.org/abs/2512.18470 * SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration https://arxiv.org/abs/2603.03823

      • bluGill 5 hours ago ago

        At least they did some analysis. I've couple AI slop "X is the best tool for the job" that didn't even try it. (Worse, we are already using QT which has a tool for the job, and the QT tool works with the rest of the QT ecosystem unlike whatever AI told them)

    • hnfong 5 hours ago ago

      It's a proxy for what you actually want to measure.

      Note that after the model generated a bunch of (intermediary) code, they still have to have it tested and get bugs fixed (via the agent/harness). In this "one shot" you still have agent loops against human defined objectives.

      And these toy examples give some insight as to how the model performs. If the test were "here's some code written by $corp, please take these tickets and work on them" it may be a "real" example but nobody would be able to make sense of actually how "hard" it is, or how "well" the model did the job, besides the workers already familiar with the context.

      At least everyone knows what a 3D game is.

      • bluGill 5 hours ago ago

        As someone who works at $corp - there is a massive different in tickets. I've seen "The is not spelled 'teh'", and I've seen some other service is writing to memory causing a crash in my service (the later took months to track down since our code was correct and nothing gives a hint of where to look). Both problems are important to fix, but the first is so simple I don't care how good AI is (the hard part is getting it through the process)

    • hintymad 31 minutes ago ago

      > I'd rather see a coding agent that can follow steps in a plan file to a T while following guardrails and adhering to the proper coding conventions in the human reviewed spec.

      In fact, I'd rather see Anthropic publish a convincing project that does this using Claude. The project should be complex enough and novel enough to show the world how reliable and powerful Claude is. That is, Anthropic does not need Amodei or its employees to tell us that whatever percent of engineers will lose their jobs. They can just show us. Easily.

    • ulrikrasmussen 7 hours ago ago

      I guess the experiment is interesting to determine if a model can produce something subjectively valued as "good" based on fairly vague and open-ended specifications. The benchmark is not to determine if the output fits the input, but whether the output is internally consistent: it's a game, but does it behave as one would expect that any game behaves? Does it end when you each the goal, do you die when hitting the spikes, are there weird edge cases in behavior when you move around?

      I think however that they should have used the same harness and also repeated the experiment a few times to judge the variance in results.

    • somenameforme 4 hours ago ago

      Unless I'm missing something, the prompt he gave must have been fairly detailed because both games are basically identical.

      But for a more practical issue, the ultimate goal of LLMs is to replace software engineers, or at least enable everybody to become a software engineer, to use a more up-beat phrasing that's no less accurate. And so an LLM's ability to reliably construct something from a poorly defined, contradictory, or otherwise flawed prompt, while accurately inferring intent is probably the first finish line.

      • metadat 3 hours ago ago

        More likely is the models were trained on similar data.

    • digitaltrees an hour ago ago

      Exactly this. I recently tried Claude code again to get the subsidy on fable rather than paying api prices and was so frustrated by how much it pushed autonomous behavior. It would start ignoring my planning documents, ignoring my coding conventions, reimplementing features and code already in the project (not sure it ever makes sense to have two auth systems in parallel or two websocket implementations for the same ui) and then in the most shocking interaction just refused to stop working and listen to my instructions. I think maybe it was because there was a subagent doing the work but it was a complete waste of time and effort.

      I was using cursor, in large part because I could at least stop it when I need to.

      I ended up building my own IDE from scratch so I can be more in the loop while also having the full agent experience.

    • pu_pe 9 hours ago ago

      It's true that no one is trying to one shot anything serious right now, but it's still an important metric. Claude Code and Opus really took off when they improved the harnessing enough that it would self-correct many of its mistakes without needing user input. In fact I think long-term autonomy (in the range of several hours) and self-correcting is going to be where we see most improvements in coming years.

      • bogtog 8 hours ago ago

        > In fact I think long-term autonomy (in the range of several hours) and self-correcting is going to be where we see most improvements in coming years.

        Right, model intelligence defines the scope of things they can one shot

        I also suspect that users naturally calibrate to a model's useful scope, gradually getting positive/negative feedback and gradually making their requests bigger/smaller than before

      • dakolli 9 hours ago ago

        it wont happen, its all a money grab.

        • OtomotO 9 hours ago ago

          I think that LLMs will stay, but I also think we've plateaued and that big companies will fail and fall and we will have another years long "halt" of any real advancements coming to the public.

          Similar to how ML was all the hype about 12 years ago and then it submerged again for a couple of years.

          • thewebguyd 2 hours ago ago

            > we will have another years long "halt" of any real advancements coming to the public

            One can hope. Probably an unpopular take here but I'm tired boss.

            The software world has a huge backlog of things that can all be done with the tech we currently have, no breakthrough advancements needed, but none of it will get prioritized when we're all forced to run on the new and shiny treadmill. Ever since LLM hype its like the javascript culture of a new framework every 10 minutes has infected every other vertical of software development and I'm exhausted.

    • johnfn 2 hours ago ago

      I feel like on HN there is an endless cycle:

      - Vibes are too subjective, I want an actual A/B test!

      - An A/B test is too limited, I want a benchmark! (You are here.)

      - Those benchmarks never seem to be reliable, I just go on vibes.

    • canes123456 40 minutes ago ago

      Isn't a plan file just a single long prompt?

    • InsideOutSanta 4 hours ago ago

      > I seriously dont' know all this big hullabaloo about one shot prompting.

      It's a relatively objective way of testing LLMs, and I think it's pretty representative of how strong models are overall.

      The outcome of this test mirrors how GLM 5.2 and Opus 4.8 work for me: they're both similarly capable of fully executing a given task, but Opus tends to have a bit more "taste" in how it handles unstated details or implicit requirements.

      > what you'll get is a series of assumptions made by the model

      Yes, but that's why we use these models in the first place. We don't want to explicitly write down all the details because that would mean writing code. So we write a higher-level, human-language spec, and let the LLM fill in the blanks. The question is how good they are at doing that.

    • scwoodal 8 hours ago ago

      > I'd rather see a coding agent that can follow steps in a plan file to a T while following guardrails and adhering to the proper coding conventions in the human reviewed spec.

      Guardrails/conventions should be enforced in linters, formatters, static analysis tooling; not specs/prompts.

      • cultofmetatron 5 hours ago ago

        lets say you have a table that is partitioned. how do you lint/format "any select into this table MUST include the partition key in the predicate and any join must include it in the on." I'm not personally familiar with any static analysis tool that does this but its trivial to implement with an llm prompt. trivially easy to add to your automated PR reviews.

      • Youden 8 hours ago ago

        It's not always possible, or at least trivial. For example how do you enforce "prefer to reuse existing code over making a copy"? Is there a static analysis tool that will detect two pieces of code that do the same thing?

      • Der_Einzige 4 hours ago ago

        Wrong, custom "specs" i.e. schemas, are literally all we have for "real" guardrails with LLMs.

        https://developers.openai.com/api/docs/guides/structured-out...

        Nothing else operates on the logprobs level and literally bans continuations that fail your schema.

        • scwoodal 4 hours ago ago

          Enforcing structured outputs from LLMs is not the same thing as using linters, formatters, static analysis to control how an agent writes code.

          • Der_Einzige 4 hours ago ago

            No, it's not. It's strictly better.

            • scwoodal 4 hours ago ago

              Can you share examples, links to Github of this approach? I'd like to learn.

    • losvedir 7 hours ago ago

      One shotting is useful to test but only with a huge prompt (eg, build something according to this spec).

      I agree generating millions of tokens from a handful of input tokens doesn't convey anything meaningful to me.

    • NichoPaolucci 8 hours ago ago

      If a model can take a series of increasingly complex instructions and satisfy the requirements without human intervention, we can pretty easily decide how well overall the model does. And, judging better models just means adding more requirements to a task. So, I think it's a useful method (Even if it's not a realistic use case).

      Of course, with a software engineer at the helm - the models are going to be able to be guided to produce much better output. (Or worse, depending on the engineer!)

      • embedding-shape 8 hours ago ago

        You seem to be missing the point of what parent is saying :)

        To really evaluate how a model is to use in real life, it should have access to tools, and be able to iterate on something, like they do when you use them in an agent harness.

        None of that iteration need necessarily to have a human driving it (although if you're building something you want to be able to maintain, you probably need a human driving the design and architecture), you can just let the model do a couple of tries and give it input into how it's doing, and you get something closer to how people use these models in reality.

      • locknitpicker 7 hours ago ago

        > If a model can take a series of increasingly complex instructions and satisfy the requirements without human intervention (...)

        This is the wrong metric to target. Today's models can feel one-shot but they are so at the expense of resilient ReAct loops that brute force their way out of the mess initial prompts created.

        And each iteration is expensive.

        Sometimes failing fast and early is better than going for one-shot models that try to mitigate the mess they created with reasoning steps and ReAct loops.

    • athrowaway3z 9 hours ago ago

      I think you're underestimating the elegance of "hey build X". It already captures a lot of what you're interested in.

      Additionally, with "Hey build X" nobody is happy with the methodology and people rightfully complain about the set up.

      Using your suggestion the methodology would require a lot of presumptions & arguments regarding why you choose it and think it relevant to people.

      Either people would not "get" it quickly enough or would disagree/not be interested on the setup because its not how they use LLMs.

    • ACCount37 8 hours ago ago

      On one hand, that's sort of true for practical uses - and benchmarks notoriously undercount multi-turn settings.

      On another, being able to reliably tackle minor tasks with no handholding is very valuable in itself. Sometimes implementation details are important, but often, the most important thing is to Get It Done.

    • gchamonlive 4 hours ago ago

      > a single prompt wont' constitute the complexity of a software project.

      The top agent is for steering, but all subagents are mostly oneshot prompts

    • jaapz 9 hours ago ago

      When the model produces reasonable results from one prompt, you could assume that it will also return reasonable results through the follow up prompts.

    • Revanche1367 9 hours ago ago

      The argument is flawed, there is no logical reason to assume a single prompt won’t be sufficient to constitute the complexity of a software project. It may not be practical in many cases but there is too much variability in what is considered a complex software project and in the sufficiency of instruction in a single prompt to make that claim and say it’s “by definition.”

      • oblio 9 hours ago ago

        And that prompt will basically be 2000 page spec Bible à la IBM circa 1960, see waterfall. Unless AI develops mindreading (and advanced mindreading at that), single prompt creation of actual complex software products will never happen. You'll one shot a simple non scientific calculator, but not Excel or Vim or Nginx.

        • trollbridge 7 hours ago ago

          Why not? Given a proper spec, you should absolutely be able to one-shot Excel, particularly if we put it at the level of complexity of, say, Excel 1.0 for Mac.

          Current models aren't capable of that, but that doesn't mean it's not possible.

          • pyrale 7 hours ago ago

            The issue is not the models, the issue is that this method ws tried before, and humans suck at writing what they want. Developing in small increments allowing feedback was an answer to this issue.

            If you made models able to code to long spec, you would be left with the hard issue of having to write them.

            • pianopatrick 2 hours ago ago

              An interesting question for me is "can the LLMs predict what humans want?".

              Like if you show the LLM a page, can the LLM review the page and then spit out a review that is close to what a human would say about the page?

            • trollbridge 3 hours ago ago

              Yes, my current nightmare is I have a very long queue of specs to write and need to work with non technical staff to help them put in words what it is they actually want.

              Software was always that way, though.

          • oblio 5 hours ago ago

            Seems like this would be a good time to use this famous quote:

            > given the sufficiently smart compiler

            For those unaware, this is a similar quote used by compiler proponents. The first full compiler was created in 1957 (+/- 70 years ago) and the "sufficiently smart compiler" never happened, hand written code from the best coders still is faster. Now, that doesn't mean that compilers didn't do the job well enough, we just accepted that 90-95% of the top speed was enough for almost everything.

            To the LLM one shotting point, it took 30 (40?) years for compilers to be good enough for the mass market. Caveat early adopter and investor.

            Plus what pyrale said.

      • dakolli 9 hours ago ago

        One shot prompting/tooling is the only reasonable way to use an llm in my opinion. You should not be having an LLM operating for hours creating thousands of lines of new code that you can never review or maintain. You can actually be highly productive modifying a single file or two at a time, ideally as focused and little context as possible, without the llm being given full permission to add as much context as possible along the way to maximize revenue for the developers of the harness.

        The agentic engineering paradigm is just a narrative trend pushed by AI companies to get people to 10x their token consumption per prompt. It plays into people's laziness and addiction to dopamine too causing addict like behavior in people that fall prey to this trend.

        • ffsm8 8 hours ago ago

          I disagree fundamentally.

          If I do that, I'm literally slower then just doing the change without sufficiently specifying it to the model.

          I can see how a junior dev or generally someone that's not particularly knowledgeable about the language or framework they're working with may benefit from such usage, but for experienced people there is very little value in that approach.

          I say this because I've just had to face this decision this month with Copilot introducing the usage based billing. I attempted to scale back my usage, first with non-opus - output essentially became discardable as it continually hallucinated no existing fields in the responses of Apis etc... Then my scoping the changes smaller and smaller, until I ultimately gave up and reduced usage to just generating tests.

          • user43928 6 hours ago ago

            I agree. And at work it has been producing some of the worst GUI test cases I have ever seen.

            What is tested often makes no sense at all, completely implausible edge cases are tested on internals, while it doesn't create tests for the overall application using user events.

            And some things in these test cases are downright ridiculous: instead of instantiating your classes, it sets up some barebones fake objects reimplementing some of the behavior of your actual class, then ignores the TypeScript errors via force cast or similar.

            Then it proceeds to slap some test ids on the output, stubs components and dependencies more or less randomly, adds some assertions on test ids and calls it a day.

            Apparently that's good enough for many colleagues to open a MR for that garbage.

            That said, at home with SOTA models I happily hand large units of work to it, outsource much of the thinking, and get workable results. I think this is the future.

          • dakolli 6 hours ago ago

            I disagree, fundamentally.

            I see little value in throwing a ton of context at an llm and waiting 10-20 minutes for a coin flip on whether or not its going to produce junk. I'd rather do quick 60 second turns, get most of the way there and fix the rest myself if I have to. I'd rather honestly just not use them.

            • ffsm8 5 hours ago ago

              Well the point was that id rather spend 30 seconds doing it myself then formulate a prompt with enough context for the model to implement it within 60 seconds. Also these numbers are unrealistic.

              Everyone that I've ever interacted with and claims to prompt in "seconds" actually needs multiple minutes to think about the solution they want the model to implement - and then need twice as long to formulate that into a sentence which provides the model enough context to actually do that

              So the more realistic estimates are "I'd rather spend the 2 minutes just implementing the minor change myself, instead of spending 1.5 minutes thinking about it, then 2.5 minutes writing the prompt and then waiting 1 minute for it to finish"

              • dakolli 5 hours ago ago

                I would agree with all those points, and my numbers are a little off. I really just don't want to use any of it. I'm more excited about fast FIM autocomplete that works well, something like cursor tab without cursor. If something can increase my wpm and take strain off my fingers that would be nice. At this point latency and accuracy is terrible though.

            • atq2119 4 hours ago ago

              The trick is to do something else in those 20 minutes (or, ideally, even longer).

              That's the main value I've been getting out of coding agents. I have them do (comparatively) simpler tasks or explorative tasks in the background while I'm in a meeting, doing code reviews, or otherwise working on something else.

    • jatora 5 hours ago ago

      I also love the term zero-shot in the AI benchmark world. So logical. So intuitive.........

    • moistoreos 4 hours ago ago

      PREACH. I have no idea why THIS has become the standard for illustrating model capabilities. It's endlessly frustrating when that was the initial objective for all these models, but, became increasingly clear over time that none of these models were ever capable of getting the desired output for complex software on the initial prompt.

      The reality is: - business rules change - ideas for improvement may arise from the initial prompt - updates to submodules/functions/configs/secrets are BLOCKERS ... etc.

      One shot prompting for the expecations of complete software is seemingly more and more a show of incompetence of the use of this technology. It's like trying to make my toddler eat a ham sandwich from the peanut butter & jelly I put in front of him.

    • halyconWays 10 hours ago ago

      "We did multi-shot prompting to try and get these two games into comparable states using these two different models."

      "Well obviously you provided better follow-up prompts to the one that came out better."

      Also nothing about human-provided plan files and guardrails preclude the one-shot benchmark test. Heavens, I almost said "real coding," but in "real agentic program creation" you'd obviously be doing multi-turn interaction with the agent, but how can you provide a fair test when the model's output n determines your n+1 response?

      • pegasus 9 hours ago ago

        Sure, real-world usage is always more difficult to benchmark, but the additional issue with the one shot prompting benchmark is that by optimizing for it, models are nudged towards making all those assumptions they shouldn't really make. Maybe a better test would be to have a fully spec'd-out plan, but start with a one shot, high-level prompt and expect the agent to discover your preferences by repeatedly asking for clarifications. The system that manages to suss out more of the details in the hidden spec this way, in less steps and with less unnecessary questions would more likely to be a truly well-calibrated agent.

    • irthomasthomas 9 hours ago ago

      Blame anthropic, they decided to make these type of one-shot examples the primary focus of the Fable 5 release, and relegating benchmark scores to the pdf.

    • miroljub 8 hours ago ago

      That's precisely the difference between an engineer and a business guy.

      The business guy would say "hey build me this and that" and would get _something_ to show of.

      An engineer will have a long conversation with a llm about the exact requirements, tech stack, tradeoffs. He would understand what is built, how is it built, and refine on the fly until he gets something sensible.

      It won't be as fast as "build this", but the result will be much better and more maintainable.

      For the enginering workflow, you don't need Fable. Any model better or equivqlent to Sonnet 4.6 would do. Yes, sometimes it will hallucinate, sometimes it'll be wrong, but it's our job as engineers to correct it and have full ownership of the result.

      • tw1984 5 hours ago ago

        what you said above is only true when the AI is not as smart/professional/knowledgable as that engineer.

        • miroljub 4 hours ago ago

          Of course it's not. Otherwise, before telling it "do this app, make no mistakes" you would need to feed an AI with the complete relevant knowledge, history, and constraints, and then your prompt wouldn't be a one-liner, but a 3000-page document.

          And yet, even the smartest AI in the world would give an alternative solution every time you invoke it. And you still need someone to judge what is right and what is not.

    • scotty79 7 hours ago ago

      Single prompt performance is interesting because best agentic results of yesterday turned out to be best single prompt results of today.

      If we stopped developing LLMs the the only reasonable way to benchmark them would be to compare yheir performance with all the tricks we can build on top of them. Sine the are still developing rapidly any apples to apples comparison is worthwhile.

      Of course this particular benchmark is not really single prompt but rather "agentic without steering".

    • alfiedotwtf 8 hours ago ago

      I think that’s the point of the Superpowers SKILL

    • LoganDark 9 hours ago ago

      The thing with one-shot prompting is that it tests the ability for the model to make good choices on its own, rather than only instruction following.

      Instruction following has been down for years, and while there are of course metrics that continue to improve as the frontier advances (for example, the ability to continue following the original instructions even as context grows), you can't really get that much better at performing a list of instructions as-written if the instructions are sufficiently precise enough that there's no wiggle room for interpretation (which seems to be what you are describing).

      For example, one of the things that got me the most excited for Fable 5 was its ability to work for over eight hours straight on a single instruction and seemingly faithfully the entire time. That was something I observed personally after trying out the same workflow that runs for maybe two or three hours with Opus and then still needs followups. Fable needed no followups. That's a game changer for me compared to the prior state of the art.

      That kind of stuff is going to end up being the most beneficial to people who are touching the edges of their knowledge or even exploring completely new areas. And that type of work is exactly the kind of work that makes agentic coding so powerful, even as much as it gets harder to judge the quality of the work when you lack the skills yourself. It's a good thing that the quality increases across the board, even for skilled practitioners.

      For example, even people who know how to write inference engines or how matmul kernels work or how to optimize model architecture can't always predict just the sheer breadth of things agents can try to improve performance, and sometimes you get over some wall and reach a completely different optimum that you just wouldn't have reached in any reasonable amount of time by applying traditional knowledge even if you're an expert in the field.

      That kind of stuff is amazing. And that's exactly the kind of stuff that one-shot prompting is testing for. It's kind of like testing for the model's "innovation", as much of an oxymoron that is.

    • epolanski 10 hours ago ago

      Yet this is how virtually everybody is benchmarking and fine tuning.

      Since Opus 4.6 I've seen later Anthropic models being more and more capable on one hand, but also less useful on multi turn open tasks.

      It feels like with each model they are more and more prone to go "their own way" and jump into the implementation as soon as they can.

      I can't but blame it on benchmarks and fine tuning around prompt-to-solution work.

  • jameson 11 minutes ago ago

    > Opus 4.8 built in Claude Code; GLM-5.2 built in Pi over OpenRouter.

    It would be more interesting and accurate to see the comparison on the same harness if the intent is to compare the frontier models.

    Pi is relatively new and does not have many features built-in compared to Claude Code. It was chosen intentionally this way as Pi's goal is not to create a bloat builtin of tools most don't use but to allow the users to customize to fit their need -- similar to Neovim vs IDE.

    The end-user "vibe coding" experience is *heavily* swayed by the harness because prompt effectively drives how a model outputs an answer.

  • meander_water 11 hours ago ago

    > So we ran it head-to-head against Claude Opus 4.8: same one-shot prompt, build a 3D platformer in raw WebGL from scratch

    Running a single one-shot prompt is not a benchmark, not is it representative of any sort of real-world usage.

    Most agent usage is collaborative so you need to test things like reliability (when I delegate a task, does it complete it without making up test results for e.g.) and steerability (does it obey my instructions or does it just do what it thinks is best).

    • jameswhitford 11 hours ago ago

      Hi, I am the author, I completely agree! I set out to run a vibe test on this one, not a benchmark, the real benchmarks are listed. My test shows what the models can do when both tasked with a long-running, technically difficult, one-shot task.

      I think your test you describe (collaborative, task delegation, task completion, TTD, steerability) is a great format for a future test that I will definitely try out.

      • ramraj07 an hour ago ago

        The important point is that your benchmark is pretty much irrelevant for the actual usage. Thus whatever conclusion you draw is not just irrelevant but misleading.

      • wongarsu 10 hours ago ago

        Tbf, most of the "real benchmarks" have issues that are just as bad. Assessing LLM performance is just hard

        • oceansky 5 hours ago ago

          And personal too. Different engineers are using them for different use cases.

      • meander_water 10 hours ago ago

        Thanks, I didn't mean to be brusque, but I have seen a lot of these vibe tests lately that come to grand conclusions like "X model is better than Y" from the result of a single prompt.

        Appreciate you sharing the results of your tests though!

    • esperent 10 hours ago ago

      On the other hand, I did just leave my pi agent running GPT 5.5 overnight on a clearly defined, long running task. It's been running about 10 hours now and it's mostly done. So this kind of use case is also valid.

      Thinking about it, I would say that the majority of agentic work I do, by a long shot, is subagents which are launched from the main session, using a prompt of its choosing. Those could be considered short versions of these fully autonomous tasks.

      • thunspa 6 hours ago ago

        Care to share more about your pi setup? I've recently started using it (after long-time Claude Code work) and was wondering how you'd achieve these long-running tasks. Do you allow it to spawn sub-agents? Thank you!

        • esperent 5 hours ago ago

          My pi usage over the past ~5 months went roughly like this:

          * Install pi and a bunch of extensions from their package repo

          * Realize that all the packages (with a few exceptions) are massively overcomplicated and vibe coded

          * Ask pi to rebuild a very simple version of the packages I used. So e.g. subagents - all the default subagent extensions are massively complicated with named agents, recursion, communication. I made one that stripped all that out.

          * Then whenever I hit an annoyance, spin up a parallel session and fix it.

          It's less work than it appears because I have ~5 extensions: hooks, subagents, background processes, a custom footer, a loop command... Maybe that's it. Within a couple of days you can have a setup pretty close to Claude Code but with a fraction of the base context use. After gradual improvements over a few weeks/months you'll have a system far better, tuned to your exact preference.

          Of course, just like Linux or any other highly tunable system equally important is having the restraint to not spend all your time tuning it. I've definitely had a couple of days where I was bored with my real work and did that, but whatever, it beats browsing reddit.

          As for getting long running tasks, I set a looping message every ~20m and tell the agent to strictly track progress in a session doc, then reread and continue after each compaction.

          • ijidak 3 hours ago ago

            What type of task are you running for ten hours? Is this a programming task?

            I've not come across a programming task that would take an LLM ten hours.

            • nfriedly 20 minutes ago ago

              I'm not the person you asked, but if they're running in their own local hardware, then it might just be a lot slower than what the big providers run their models on. System RAM is a lot cheaper than VRAM, especially if you bought it last year.

      • jameswhitford 9 hours ago ago

        Yes, part of the reason I chose the one-shot test was really to test long-running tasks. A lot of people seem to be experimenting with this format, for example in the now trending loop-writing workflows. And really I am interested in diving into the murky waters of these novel workflows.

    • ritzaco 11 hours ago ago

      sure that's why we look at a mix of formal benchmarks, one longer analysis of a side-by-side, and various other people who we trust to form an opinion, all covered in the article - not intended to be a formal benchmark, there are enough of those.

      • patates 10 hours ago ago

        Then maybe you should add that caveat emptor to the article?

        You make a very strong claim at the end that the hype is mostly real, and making it clear to what extent your claim holds should help the reader.

    • unliftedq 10 hours ago ago

      Totally agree, a single one-shot prompt can't prove anything.

  • faxmeyourcode an hour ago ago

    I feel like another comparison worth looking at is purely cost.

    Capability per dollar is something I care about:

        Opus API    $5/$25
        Sonnet API  $5/$15
        Haiku API   $1/$5
    
        GLM 5.2 API $1.4/$4.4
    
    So you're really getting near opus level capability for the price of haiku.
    • cmrdporcupine 17 minutes ago ago

      Not really, GLM uses more tokens to get work done.

  • lukaslalinsky 4 hours ago ago

    I was never able to get these models to collaborate with me the way Opus does. I'm probably an outliner, I don't one-shot projects, I don't vibe code. I basically use LLMs are if I was working with a coworker, fairly smart one, but with short memory and often missing the big picture. Sometimes I can delegate more, sometimes less, but I know I always have to stay on top of what's happening, because it WILL create mess when it hits something hard. With the Antropic models, this kind of cooperation is easy (with the exception of Opus 4.6, which was bad for some reason).

    • Terretta 3 hours ago ago

      > Opus 4.6 which was bad for some reason

      If I recall, that model had a couple issues. One was the issue of being monkeyed with, for which they gave everyone credits.

      The other feature/bug, depending on your POV, was being Anthropic's least personable release, not papering over everything with self help guru therapy language.

      Opus 4.6 didn't LARP. It was more direct, less fussy, less discussy, and very much less "wait, one more thing" within a couple edits after embarking on what should have been the spec, than 4.7 or 4.8 are.

      When in engineer brain mode, working as as you describe (good old fashioned XP-style staff engineer pair programming with a language-savvy mentee not yet full-stack or system wise), I found the clearer I was about my goal and the better I could express it, the more often I'd get an expanded clarified response I could then iterate to steer for ever tighter cleaner more specified responses, then let it go build the whole thing without it agonizing and waffling.

      The next two releases regressed on that dimension, wanting to figuratively "sit with" every decision and re-validate spiritual alignment along the way, no matter how clearly expressed.

      Curiously to me, Fable seemed to hit the best of both worlds, I had the highest commit per turn with Fable, approaching 73%, where I'm usually under 17% of LOC written being good enough to commit, usually taking 9 - 11 turns to get the code where I'm comfortable with it.

      Thanks to this, Fable cost more, but actually cost less, if that makes sense.

      Arguably, Fable, and 4.6, played more outcome-correctness oriented than journey-experience oriented. It's easy to see how this could happen with human reinforced learning if not all judges are staff or principal engineer level, or constitution values are more Portlandia than Finlandia.

      ANTHROP\C needs to balance these at the constitution level:

      “We will work in a humane and thoughtful way, but production is the final judge. We will listen to people, but we will not let discussion replace decision. We will value craft, but not at the expense of usefulness. We will move fast, but not by hiding risk. We will measure outcomes, but not pretend that everything important is easy to measure.”

    • x312 3 hours ago ago

      A lot of open weight models don't understand intent well, they'll overfixate on a word in the prompt or just go off the rails trying to do much work.

      GLM-5.2 actually has really good intent understanding though, on par with GPT-5.5 and Opus from my experience.

    • therealdrag0 3 hours ago ago

      What do they do instead of collaborating?

  • xlii 10 hours ago ago

    I've been checking out GLM 5.2 on some projects and few thoughts on it:

    - it takes it sweet time to get code rolling, not the fastest model by any means

    - it strays a lot during discovery/planning but then corrects

    - it's not steering friendly, as it hallucinates things that it doesn't follow later on

    - its output is quite good

    A sample use case: I was optimizing rendering on Swift+Zig codebase. It chocked on 5k data entries.

    GLM 5.2 spent 20 minutes building the benchmarks and getting data out, which made me frustrated so I blocked non-editing tool access and went AFK, after approx. 30 minutes I found that it used already-made benchmarks and some "conclusions" to optimize 3 choke points. Output pointed that it couldn't validate suspicions and asked for more data.

    Implementation worked well, it was idiomatic and non-intrusive. I would even say that it was more idiomatic than GPT 5.5 effects on same repo.

    I would opt in in using it more BUT GPT usually completes same requests 5x faster.

    GLM 5.2 was spark for preparing and running inside isolated containers with JJ workspaces (so that multiple can be ran in parallel).

    • nijave 3 hours ago ago

      >it takes it sweet time to get code rolling, not the fastest model by any means

      Which provider are you using? I got a z.ai Lite Coding Plan and it's my understanding z.ai is on the slower side of providers and the Lite plan gets lower priority on top of that. In the api key console, it shows dipping below 60 tok/sec which is quite slow.

    • trollbridge 7 hours ago ago

      I used it the other day for something of low importance that other models simply weren't figuring out and I didn't want to burn up Opus 4.8 on. (It had to do with overriding left-click on a macOS menu bar and then making Ctrl+click or right click bring up the menu like left-click normally does, and doing all this conditionally.)

      Switched the model to GLM-5.2 halfway in the middle of a troubleshooting session (didn't even bother to reprompt, just changed it in the middle of its reasoning), gave it a few minutes, problem fixed. This is with the subscription based allocation on OpenCode Go, where a problem like this would completely burn up my Opus for the current 5 hours or even the current week.

    • jeremyjh 7 hours ago ago

      Its also nice that you can see its entire reasoning trace. I can see it going off the rails - or see something I forgot to tell it - and stop and correct it. Or I'll learn WHY it made the choice it did and not have to question it after.

      • jauntywundrkind 7 hours ago ago

        Strong agree! I deeply appreciate this aspect of GLM. Watching it think & being able to nudge early is incredibly useful. Being able to point at bad assumptions is incredibly useful. Watching what it's seeing is super informative.

        It's always a shock to me how opaque most other models are!

        It also is pretty resilience to letting you inject in while it's working without going off course or while getting back on track after, which I appreciate

        • Sanzig 4 hours ago ago

          > It's always a shock to me how opaque most other models are!

          This is (unfortunately) by design. The proprietary models hide their reasoning traces so they can't be used for model distillation. Sometimes even when they do show reasoning, it isn't the model's real trace - IIRC, someone was able to demonstrate that Opus' reasoning is usually a summary made with Haiku behind the scenes.

          • braebo 2 hours ago ago

            It is such a momentum killer being forced to stare at a silly word for 4 minutes instead of being able to read the thinking as it streams in. I can’t wait until I can drop Anthropic at work. Their UX sucks, intentionally, for anti competitive reasons like “don’t distill our model we trained on all the data & IP we stole and processed with the mass exploitation of data workers in the global south!”.

    • Oras 10 hours ago ago

      Also pricing, I wanted to give a try, but when pricing is only 30% cheaper than Opus, I wouldn't go for it with these issues.

      • nijave 3 hours ago ago

        z.ai coding plan is a fairly decent deal at ~$16/mon USD considering it's supposed to have a fair bit more usage than the comparable $20/mon Claude plan. On the other hand, z.ai seems a bit on the slower side for raw model tok/sec throughput.

      • chpatrick 8 hours ago ago

        It's pricing is a lot cheaper if you can run it yourself.

        • nijave 3 hours ago ago

          Not this one. It's a SOTA-class model >800Gi VRAM required at fp8

      • jeremyjh 7 hours ago ago

        What?

        It is less than 20% of the cost of Opus at API rates. 1.40/4.40 vs 5/25.

        • cmrdporcupine 6 hours ago ago

          Not when you factory in token efficiency. It burns a lot more tokens to do the same job, so when I compared to GPT5.5 I was frankly not really much ahead, and with weaker thinking.

          Maybe makes sense if you have z.AI's (not greatly priced) subscription plan, but it's not competitive against an OpenAI or Anthropic monthly coding subscription plan. I burned through almost $10 worth of tokens just doing an hour of work.

          • Sanzig 4 hours ago ago

            Take a look at Ollama Cloud: https://ollama.com/pricing

            You get access to a whole bunch of bleeding edge open models including GLM-5.2, Kimi K2.7, DeepSeek 4 Pro, etc. Inference is run on US/SG/EU cloud providers with zero data retention policies. The $20/mo tier is very generous, in my experience.

            • jeremyjh an hour ago ago

              They don’t have a statement about where it is run or data retention on the GLM5.2 model. They do state that for others, like MiniMax.

    • Imanari 10 hours ago ago

      This mirrors my experience. I have been using it in Pi. It is smart and output is good but it is not efficient in getting there.

      • ju-st 10 hours ago ago

        which thinking level? max or high?

  • toddmorey 5 hours ago ago

    I’m actually amazed at the output since GLM doesn’t have eyes. If GLM 5.2 costs 1/5 as much, seems like it could be set up to reach out to a multimodal model for vision tasks when required. Closer to parity but probably still significantly cheaper.

    • horsawlarway 4 hours ago ago

      I'm also very impressed at the output given the lack of image support.

      They picked a task that heavily favors a model that can do multi-modal with images, and GLM still came within striking distance.

      What I'm hearing from this article is that the next generation of open models that includes better multi-modal support are basically no-brainers for adoption.

      Seems like a HUGE win for Z.ai and open models in general here.

  • coreyburnsdev 2 hours ago ago

    People are looking for ways not to burn through their premium subs when in many cases all you have to do is move down to 5.4-mini codex and it will probably solve your issue while barely touching your 5 hour or weekly limits.

  • ulrikrasmussen 10 hours ago ago

    > Through an API it costs a fraction of Opus, and you can run it yourself for free if you have the hardware.

    I haven't been keeping up on hardware costs for state of the art LLM inference, but this remark made me ask myself how many readers of the article would actually be able to run this model on hardware they own. How much would it cost to acquire such a setup?

    • trollbridge 7 hours ago ago

      GLM-5.2 performing like it would from a good provider - 8x B200s, so $450k. (No personal experience here)

      GLM-5.2, severely quantised, 512GB Mac Studio, somewhere between $10k-$35k for a used M3. Or run it on a CPU with 768GB of RAM by getting an old PowerEdge with DDR4 for around $5,000.

      Qwen-3.6-35b-q6, runs well on an RTX 5090 ($4000 + cost of a PC), runs medicore on an Intel Arc B70 ($1000 + cost of a PC plus lots of fiddling to get the setup to work right).

      Gemma is a good candidate for the cheaper stuff, but I lack personal experience with using it locally

    • jack_pp 10 hours ago ago

      This framing local LLMs as free is stupid. Basically pay 100+ months worth of API costs up front isn't free in the slightest. And it will be slower than non-local, your hardware will be outdated in 12 months and probably won't be able to run SOTA at anywhere near non-local speed in max 20 months

      • ulrikrasmussen 10 hours ago ago

        Yeah, it glosses over a gigantic capital expenditure. It's sort of like saying that an open source modern CPU architecture allows you to build your own CPU "for free" (provided that you own and operate a fab).

      • cicko 10 hours ago ago

        True. But there are other meanings of "free". I.e. nobody can say "from now on you no longer have access to model X because you're an asshole"

        • trollbridge 7 hours ago ago

          Some obvious examples of why you'd want to spend the capital on this would be, for example, making some kind of autonomous system which needs to be periodically be offline, or you need complete confidentiality of what you're using the model for, etc.

          To be cost effective with inference providers, you have to find some way to be using it 24/7.

        • Der_Einzige 4 hours ago ago

          The ecosystem for inference is centralized around a few core projects, i.e. vLLM, sglang, and llamacpp.

          If they decided to collude, they could absolutely say "from now on you no longer have access to model X because you're an asshole"

          The commercial inference offering are also downstream of one of those 3 projects (or trt-LLM if they're nvidia). It would impact Ollama, and fireworks, together, and everyone else.

          Don't tempt fate.

    • bestouff 10 hours ago ago

      The price of a small house.

    • crimsoneer 10 hours ago ago

      Practically nobody.

  • stevenhubertron 5 hours ago ago

    No one has really talked about hybrid and using Opus to plan and orchestrate GLMs work both through initial build and code reviews. That’s a true best of both worlds and there doesn’t need to be a winner.

    • jeremyjh 4 hours ago ago

      This is the way but Anthropic doesn’t make it easy, so I use GPT 5.5 in that role since I can use my subscription in OpenCode or OMP.

      I also use MiniMax-M3 in utility roles like explore/library tasks.

      I’ve had a z.ai subscription for several months so I’m on the older pricing. I’m really not sure it would make sense to do this at current rates - I could bump my Codex plan instead.

    • mattew 4 hours ago ago

      I mostly use Opus for skill development. Once I have a solid skill implementation with a good eval, I move ongoing execution to a cheaper model running under Goose. With the eval you can see if the cheaper model works well enough.

  • postatic 10 hours ago ago

    I've signed up with Ollama to experiment with these open source models. For the past 3 months, it's just been experimenting, trying it out. GLM is the first model that I am using on a daily basis to do my coding work (as well as using Claude). It's good - I've been maxing out my Ollama usage limits everyday :)

    • jameswhitford 9 hours ago ago

      Cool to hear, what kind of tasks have you been using GLM for? And what other models have you found useful through Ollama?

  • xg15 9 hours ago ago

    So GLM emits fewer tokens and does fewer tool calls, but still takes over twice as long to complete.

    Can someone explain to me where that time usage is coming from if not from the model operation itself?

    Are the individual tool calls more complex and take more time to complete? Or is the rate of tok/s lower because the model does more compute per token?

    • iagooar 8 hours ago ago

      I have noticed that Opus and GPT 5.5 are very good at adjusting their thinking / reasoning intensity depending on the task at hand, something the open weights models are still not as good at.

      In addition to that, some of the open weights models like GLM 5.2 or DeepSeek v4 Pro tend to be MUCH slower when generating tokens, which contributes to the perceived slowness. Although I wouldn't call models like GLM 5.2 slow by any means, e.g. it is currently one of the fastest models inside Notion today.

    • twobitshifter 7 hours ago ago

      Probably the data center where the model is running more than anything. Another option is if Opus is using anything like a Mixture of Experts approach, in which case the amount of the model loaded in memory at one time could be smaller than GLM.

    • radu_floricica 9 hours ago ago

      Could just be infra. I'm betting Anthropic is much better prepared.

  • js4ever 7 hours ago ago

    "GLM-5.2 hit a problem here, because it can't read images. It isn't multimodal. So instead of looking at a screenshot, it fell back on a hacky workaround: it wrote scripts to read the raw pixel data and check whether the colors came out roughly as expected."

    A better way would be to use https://github.com/openbmb/MiniCPM-V

    • twobitshifter 7 hours ago ago

      Right, just give the text llm access to a vision specific agent and that problem can be solved. Or if you really want let it even call Opus with an image - seems like you’d still save money

  • InsideOutSanta 2 hours ago ago

    One nice thing about GLM is that it has never refused a task. I'm working on a website that renders countries right now, and Anthropic's models regularly give me the old "This request triggered safety guardrails."

    I'm not sure what exactly triggers it, but it seems to happen when it has to look at lists of countries. I suspect there must be at least one country name that triggers the safety guardrail.

    You'd expect GLM to balk at something like Taiwan, but so far, it hasn't.

    • johnnyApplePRNG 12 minutes ago ago

      The amount of times I have had to spend tokens to attempt (in futility) to convince a proprietary model that the request I asked it to perform on code that I wrote is safe/legal/moral is insane.

      Part of me wants to believe they really do care about protecting the world from... something... I don't know quite what exactly tbh... but it must be costing them a small fortune to scan each input and output against N guardrails and they are a for-profit corporation who could easily turn a blind eye to all of this and simply say "what you do with this model is on you" like I would expect most corporations to.

      Strange times.

  • XCSme 3 hours ago ago

    Check out my comparison too, it has some not-really-benchmarks too (between any two models actually, SVG generation test and CSS animation test):

    https://aibenchy.com/compare/anthropic-claude-opus-4-8-mediu...

  • mellosouls 2 hours ago ago

    GLM-5.2 cost a fraction as much. Opus finished in half the time and shipped a cleaner game

    This implies Opus was potentially much (?) better value.

    GLM cost a quarter but Opus was twice as fast. So we are already at GLM actually costing half when you compare on time, without even considering the extra effort and time it would take to get Opus-par results.

    It's good to have cheaper options and very impressive to see the Chinese continue to set open standards in this field, but the article is maybe a little over-generous.

    • InsideOutSanta 2 hours ago ago

      For me, time doesn't matter for LLMs. I can start a bunch of tasks, and I'll review the PRs when they're done. Faster is nicer, but if the task gets done correctly, I'm good.

  • wiremine 5 hours ago ago

    I've been using GLM 5.2 extensively for the last few days. It is slower, and the lack of multimodality is a bummer.

    But, it produces solid results for a fraction of the price. Worth checking out if you have the time.

    One of my goto "tests" of a new frontier models is having it rebuild a programming language from scratch. For GLM 5.2 I had it rebuild the old Rebol language in Rust:

    https://github.com/mhs/rebol-clone-glm-5.2

    It did a fairly good job roughing in the language for a low token cost.

  • david_shi 10 hours ago ago

    > GLM-5.2 cost a fraction as much. Opus finished in half the time and shipped a cleaner game.

    Off topic, but does anyone else instantly pick up on LLMisms like this? It seems like all the models have converged on this style of writing, and improvements aren't really changing it.

    • yard2010 6 minutes ago ago

      I cannot unsee it.

      There was this dude here not long ago who bought like $70k worth of gpus to research, and if I'm not mistaken his research was something related to make llms sound less llm-y. I wonder how it goes for him.

    • speedgoose 10 hours ago ago

      I think a bunch of real humans started to adopt the LLMs writing style.

      • himata4113 10 hours ago ago

        Yep, as I reread my own sentances I notice these LLMisms and have to rewrite them quite often. Reading so much llm-output definitely impacts your writing style.

      • lelele 9 hours ago ago

        Indeed. I'm trying to develop a similar style. The phrasing in the quoted passage is really tight.

    • jameswhitford 9 hours ago ago

      This is excellent feedback thank you! These LLMisms in writing are a challenge I am living with currently and trying to improve on. The technical writing industry is taking a huge knock right now with companies demanding more work in less time with a big drop in quality, day to day I get less and less time to work on the quality in the prose of my work. We are working at the frontier of this right now, so we are the most heavily effected, but also get to experiment with the changes first which can be both stimulating and very frustrating.

    • VulgarExigency 9 hours ago ago

      Yes, and it's really grating. It's like half of all new writing is done in the same "voice" now.

  • pietz 9 hours ago ago

    GLM 5.2 has one big issue that will limit its meaningful success and that's the value of their coding subscription.

    Yes, in terms of API pricing, GLM 5.2 outperforms the competition. But the only people that use API billing for their coding work are large corporations, where these highly subsidized subscriptions are being fazed out.

    At the same time, none of these companies will use a Chinese API for their employees.

    For individuals and smaller teams, Z.ai's coding subscription is outperformed by Anthropic and OpenAI. You probably get around the same usage with Claude, but Codex definitely offers more usage for the amount you pay.

    We can have a debate how much Z.ai closed the gap to GPT5.5 and Opus 4.8, but if I can freely decide between them in a world where they all cost the same, I simply wouldn't choose GLM.

    So the important question becomes: How good will the offering from Z.ai get with GLM 5.3 or 6 and how much will OpenAI and Anthropic cripple their current offering in the near future.

    • Certhas 9 hours ago ago

      My impression is that individual subscriptions are the loss leading hook. The money is made on Enterprise token contracts.

      Employees and students used to coding with thousands of dollars worth of tokens (on a 20/100 dollar plan) will push enterprise to spend.

      Having a Chinese model that is competitive won't displace this enterprise spend. But an open model hosted in the US/EU might.

      The existence of GLM 5.2 puts a ceiling on how much OpenAI/Anthropic can charge for API Access.

      • LUmBULtERA 8 hours ago ago

        > My impression is that individual subscriptions are the loss leading hook

        Except there is no evidence of this at all, just people comparing API and subscription pricing. The leaked financial info for OpenAI shows inference is profitable right now, though it does not show a distinction between subscription and API revenue... but if subscription revenue was so lossy, it would hard for total inference to still be profitable.

        • CuriouslyC 7 hours ago ago

          Anthropic has indicated in the past that API gross margins are ~60%. This might have improved since then, though competition from OAI puts a ceiling on that.

          • LUmBULtERA 7 hours ago ago

            Subscription inference can also be cheaper than the cost of API inference if the provider wants it to -- providers can do flexible scheduling for subscription inference for example, around API inference, to lower its cost and get better utilization of the hardware.

        • Certhas 7 hours ago ago

          I did clearly say "my impression is". And you have no evidence to the contrary. We don't even reliably in w how many subscribers Vs enterprise customers they have. And the OpenAI leak doesn't even cleanly say that inference is profitable from what I can tell... The better evidence that it probably is are the prices charged by open weight model providers.

          • LUmBULtERA 7 hours ago ago

            Fair enough, there is not strong specific evidence to the contrary except about overall inference being profitable for OpenAI (as well as the open weight model providers hosted throughout the world).

      • fbnszb 7 hours ago ago

        > The existence of GLM 5.2 puts a ceiling on how much OpenAI/Anthropic can charge for API Access.

        I believe this is the reason why we can even have this debate. Without this kind of competition we would not have these subsidies.

      • pietz 8 hours ago ago

        To be clear, I agree with this and they have my unlimited support pushing for relevance of open source models. GLM 5.2 is amazing and I couldn't be more excited.

        I just think that as of today, most people will not find a good reason to switch to GLM.

    • twobitshifter 7 hours ago ago

      Taking a view from outside the USA, European companies just had Fable taken away due to US export controls, and before that Anthropic announced it is holding their data for 30 days. There is immediate value to these firms to build their infrastructure around an AI that won’t be pulled away from them. And outside of Europe, other countries are more price sensitive and don’t have the same fear of building relationships with Chinese companies.

      • WarmWash 5 hours ago ago

        There is no such thing as a relationship with "chinese companies". In China there is just the State, and that is it.

        If the world needs any more evidence of Europe's short-sightedness, it would be them running to China to spite the US (instead of creating fertile grounds for their own tech).

        • metobehonest 4 hours ago ago

          No one is running to China to "spite the US". Recent geopolitical developments have shown the US to be a violent, unpredictable and unreliable partner.

      • SubiculumCode 5 hours ago ago

        And you have that guarantee from Xi?

        • bornfreddy 2 hours ago ago

          With openweights? Yes. It might halucinate a backdoor somewhere ( not that you can trust any model about that), but it will still work.

    • edg5000 7 hours ago ago

      This is an important point. I suspect API pricing will eventually disappear just like how paying for an MMS disappeared. It's an antiquated model. The bulk of the work is being done on "coding plans" is my wild guess.

      It's annoying that the plans are so restrictive beyond usage limits. Understandable maybe, but annoying. In practice, only Anthropic (and maybe Google) are really restrictive though. They really scared me away with their policy of charging API rates after the fact if they consider your usage not TOS-aligned. This might be an ungrounded fear that I have, but I feel this is something they'd do so they scared me away.

    • HarHarVeryFunny 6 hours ago ago

      > But the only people that use API billing for their coding work are large corporations

      As well as people using 3rd party harnesses like OpenCode.

      > At the same time, none of these companies will use a Chinese API for their employees

      So who are Amazon Bedrock (who serve GLM) targetting?

      Individuals are presumably going with one of the cheaper US providers such as DeepInfra ($0.18/M cached input for GLM vs $0.50 for Opus) or Fireworks AI.

    • veber-alex 5 hours ago ago

      The value of these models is that you can run them on your own hardware.

      A company can buy a NVIDIA B300 and serve it's developers in house with unlimited tokens.

    • tw1984 5 hours ago ago

      > At the same time, none of these companies will use a Chinese API for their employees.

      nice try but you intentionally ignored the entire Chinese market & Chinese big corporates. there are 130 Chinese companies in the fortune 500 list, with an average revenue of 80 billion USD each. do you think they are going to sign up for Claude, Codex or GLM? now consider South East Asia, Africa, Middle East, Middle Asia and South America, tell me why their large corporates won't be using GLM API billings?

      your western centric view of the world is totally out of date, like it or not, 2026 is vastly different from 1996, the US no longer controls high tech whatsoever.

    • tpm 7 hours ago ago

      Also, I was testing out the GLM 5.2 using Openrouter because that's where I've got an account with some money and then when I wanted to perhaps subscribe for a better deal at z.ai, their infra was clearly overloaded to the point the 5.2 was timing out on 100% of chat requests, so perhaps I will try later when the infrastructure catches up with the model capability. Only then I can make sure their subscription is worth it.

    • jauntywundrkind 7 hours ago ago

      I'm on glm pro subscription and I get so so so much more usage than Claude or Codex! I hammer on glm all day. It's a more expensive plan, but I would need a much much much bigger plan for codex or Claude to do what I do.

  • elliotbnvl 3 hours ago ago

    It is insane that we are comparing locally-hostable models to leading cloud providers, it is wild to me that this article even exists.

    We have come a long way, and very clearly have a long way yet to go.

    • nijave 3 hours ago ago

      Calling GLM-5.2 locally hostable is a bit of a stretch. It's 1.5Ti of weights at bf16. FP8 requires >800Gi of VRAM which is well into data center multi-GPU systems

      • elliotbnvl 3 hours ago ago

        It's more about the trajectory.

  • jkwang 10 hours ago ago

    GLM-5.2 is quietly becoming the most interesting open model release this year. The coding benchmarks are surprisingly close to frontier models at a fraction of the inference cost.

    • em500 10 hours ago ago

      We've had the great small Qwen 3.6 early April that many could actually run on their laptop. Then similar from Google a few weeks later (Gemma4, better in prose, worse in code). Then the super cheap large Deepseek V4 a few weeks later. Then antirez DS4 build that made that actually runnable on MacBooks and Mac Studios. And now the "near-frontier / near-Opus" GLM 5.2.

      For people who follow open LLMs, none of these were quiet and all were the most interesting open model release for a few days/weeks. In one or two months, it will be some other model again. Now I do appreciate the real rapid improvements in open models. But there's also a ton of hype and fast-fashion around all of this.

      • CuriouslyC 7 hours ago ago

        The difference here is that those small models are impressive, but not super useful. Deepseek 4 is impressively cheap for the intelligence, but not reliable enough to daily drive unless your time has low value.

        GLM passes a meaningful threshold of reliability/utility that puts it in a different category for real work. Just like Opus really took off after passing a threshold with 4.5. It's the first open model to do that.

        • hnfong 5 hours ago ago

          Qwen models are super useful for those running local.

          And there are valid reasons to run local, even if performance (quality and speed) aren't best.

    • epolanski 10 hours ago ago

      To me DS 4 is still the most interesting due to much lower costs. Also DS 4 training isn't done yet.

      From my Opus vs DS 4 Pro personal benchmarks, 16 different real-life work tasks, DS 4 has performed as well as Opus 4.8 high overall but with few drawbacks:

      - on the 16 tasks, one needed several prompts to be steered back into the topic

      - its review capabilities seem much worse

      - DS4 had the cleanly better solution in 3 cases out of 16, with Opus "only" doing cleanly better 2 times out of 16. But still, I want to emphasize, is the worst case scenarios that imho matter the most, not the best ones, and on that front Opus outperformed.

      That being said I spent less than 2$ of API working 4 days, which is more or less what I would've spent with Anthropic APIs for less than one task.

  • maxdo 5 hours ago ago

    So the benchmark is : Two models with different harness produced very different results .

    Glm game was completely broken Opus game was at first glance ok but also with bugs

    Different models with different cost produced different non perfect results . How is it “close” ? :)

    Also on costs : glm burns more tokens on average vs opus . Gpt5.5 burns less surprisingly

  • greyman 10 hours ago ago

    >On output tokens, GLM-5.2 is less than a fifth the price of Opus.

    Opus is most expensive model in pay as you go model, but IMO fair comparison should include subscription price as well. For example when one has $100 Claude Max and use it up through the month, it might not be more expensive than GLM, or at least not 5x.

    • Aozora7 10 hours ago ago

      There is, for example, OpenCode Go subscription, which for $10 a month gives you a decently generous quota of GLM-5.2, among other models.

      And z.ai themselves also have subscriptions.

      • sourcecodeplz 9 hours ago ago

        to be exact, it gives you USD 60 of usage of open models.

    • KronisLV 9 hours ago ago

      > For example when one has $100 Claude Max and use it up through the month, it might not be more expensive than GLM, or at least not 5x.

      https://z.ai/subscribe

      I’m currently trying to figure out whether a downgrade from Max 5x to Pro in combination with one of those would save me money and if so, how much.

      Edit: seems like Anthropic Pro + GLM Pro (Yearly) would let me almost halve my costs of Anthropic Max 5x. Only concerns are about GLM 5.2 not having vision support and also being kinda slower and also not being as good as Opus.

      • CuriouslyC 7 hours ago ago

        I'm considering shifting to the OpenAI $20 plan + GLM. OAI has the best computer use, vision support and the best programming intelligence of any model short of Mythos/Fable, and the quota is a lot more generous than the Anthropic $20 plan.

    • jameswhitford 10 hours ago ago

      Yes this is true. This test was run on a $20 pro Claude subscription. I would definitely love to try use both models on the highest plans for a whole month and compare the two, great format for a future head-to-head comparison.

    • buster 10 hours ago ago

      Is it fair when the one is heavily subsidized and the other one is not?

      I think it's most fair to compare the plain token pricing that is used by everyone.

      • esperent 10 hours ago ago

        > Is it fair when the one is heavily subsidized

        As a consumer, yes, it's totally fair. All that matters to me is the price I pay at the pump, not whether that price is "real" or not.

      • usef- 10 hours ago ago

        Z.ai is also believed to be "subsidised". Its parent company is running at a large loss right now.

        Anthropic have claimed they expect their first profitable quarter this year -- they may have bigger margins on their raw API than you realise.

        • stavros 7 hours ago ago

          We're all sure they have big margins in their raw API, it's the subscription we're claiming is subsidised.

          • usef- 7 hours ago ago

            Oh I know. But people often point to the API usage cost as an indicator of the magnitude of subsidisation, or to say that the big labs are far less efficient than cheap competitors.

            I'm saying that this is not necessarily the case. They do a lot of optimisation and don't have the same price pressure to lower margins. They may not be losing as much on subscriptions as people think.

            • stavros 7 hours ago ago

              Oh hm, I've never seen this. API prices have always been exorbitant, I'm sure they're making good margins on that. Let's hope they aren't losing as much on subscriptions, because I'm not ready for everything to be API costs.

    • lithiumii 10 hours ago ago

      GLM has subscription plans too.

  • doe88 4 hours ago ago

    To me one shot prompting is as relevant as Strava's KOM is for cycling, i'm more interested in a good cycling performance after a 3 hours ride than a straight up 30 min record effort.

  • bornfreddy 2 hours ago ago

    I know that running this locally is prohibitively expensive (for now), but what kind of cost would I be looking at if I wanted to rent the hardware and run the model by myself?

  • stavarotti 5 hours ago ago

    These style of comparisons are decent at showing capability but they don't really show me what I truly want - a sounding board and implementer with senior engineer-level execution. When I look back at all the teams that I've been part of, the best outcomes came from white-boarding (sometimes in the metaphorical sense) with one or two people, at times arguing, then finally compromising on a plan. Instead of synthetic benchmarks that try to be objective, I wonder if there's a way test this, or maybe I'm opining on a way of working that will soon be gone?

  • zkmon 10 hours ago ago

    Cost difference matters most as cost optimization is the whole point of AI. Time difference (30 min vs 1 hr) is not a deal-breaker. The small precision gap on the first iteration does not matter for 99% of the work that happens in real world.

    • jameswhitford 9 hours ago ago

      Yes I 100% agree. Time-taken can be improved (with harnesses, subagent workflows etc.) and varies based on task.

  • TurdF3rguson 10 hours ago ago

    Pretty clearly it's beating Opus at [web dev](https://www.gptbased.com/) - on price, on score.. I mean what else is there?

    • myaccountonhn 8 hours ago ago

      Article states it's not multimodal. I guess that means for webdev it means you can't take a screenshot to indicate errors etc.

    • jofzar 10 hours ago ago

      I hate to be that guy, but real privacy policy on training data/it being hosted somewhere where I'm not worried about secrets being stored/leaked.

      • HPsquared 9 hours ago ago

        Open weights win on that front surely?

        • jofzar 9 hours ago ago

          Assuming I have 20k to run my own version of GLM?

          • mcintyre1994 9 hours ago ago

            I guess the idea is that you probably can, or will be able to, find a host that you trust at least as much as you trust Anthropic.

      • Havoc 9 hours ago ago

        Realistically you’d need to rotate secrets anyway once it moves from dev to production regardless of model provider

      • CuriouslyC 7 hours ago ago

        2016 me would agree, but 2026 me looks at Trump and Dario, and at China, sees basically no ethical difference (or possibly even an ethical deficit for America) and considers that perhaps it's better to go with the option that isn't trying to hoodwink me with bullshit platitudes and flag waiving while doing whatever they want in actuality.

      • dkersten 9 hours ago ago

        Its on other providers, like Together.ai

    • trick-or-treat 10 hours ago ago

      Latency? Just saying there's other things to consider.

  • CuriouslyC 8 hours ago ago

    You should repeat this experiment but with progressively more detail in the initial prompt. Claude's secret sauce is taking weakly specified prompts and making passable things from them, but as the degrees of freedom in the prompt go down Claude starts to disobey while other models close in on the intent.

    • jameswhitford 7 hours ago ago

      That is a great suggestion that I am definitely going to look into, thanks!

      • Babooz 7 hours ago ago

        Nice comparison, but perhaps a more informative one would be to keep the harness the same and use Claude Code for both model. In your comparison, the differences could be due to many harness design decisions.

  • thedreammachine 7 hours ago ago

    I was surprised today by how much better GLM-5.2 was than GPT-5.5 at aesthetic/UI work. I'll keep my Claude/Codex setup via Conductor for now, but this model got me to set up OpenCode, download their desktop app and do most of my work there today.

  • somesortofthing an hour ago ago

    this comparison seems kind of pointless if one model has vision and the other doesn't. obviously a model that can see is going to beat a blind model at making a video game.

  • xrd 5 hours ago ago

    How are people running this locally? I just checked llama.cpp and it appears unsloth has a version but it hacks a bunch of things to make it work and isn't optimal.

    https://github.com/ggml-org/llama.cpp/issues/24730

  • Aozora7 10 hours ago ago

    I used GLM 5.0/5.1/5.2 for some projects, and for me, the area in which they lag behind frontier models the most are user interfaces. They get really close to Opus when it comes to pure algorithms, but when I need something like web application or a mobile app that looks and works well, they are very noticeably worse than even Sonnet.

  • Muaz_Ashraf 2 hours ago ago

    there is no comparison between glm 5.2 and opus. First for this glm 5.2 you need a big big resource and that big also came from money so instead you buy the opus subscription and enjoy.

    • nickv 2 hours ago ago

      Or...

      you go to OpenRouter and pay

      $0.98 / $3.08per 1M for GLM 5.2 vs $5 / $25per 1M for Opus.

      GLM 5.2 gives you OPTIONALITY so you can run it locally, but you can still just pay somebody for it.

  • samsin 5 hours ago ago

    My understanding was that n-shot prompting just referred to the number of examples included in a prompt, not the number of prompts to achieve the desired result.

    "Build a 3D platformer game from scratch, in raw WebGL, with no game engine or 3D library" would be a zero-shot prompt.

  • hmokiguess 6 hours ago ago

    I signed up for GLM 5.2 yesterday to try it out because Anthropic kept throwing 529 Overloaded

    I like it, but the lite plan ate 22% usage of my 5h reset window in a single session after 2 prompts on xhigh of GLM 5.2 [1m]

    Result was satisfactory, I think stuff is decent, I'm happy to use either, wish there was a combined subscription plan where I could get both

    • w4yai 5 hours ago ago

      I may be biased and interested as I'm going to give you an affiliate link, but really honestly Synthetic LLM provider is a beast! They provide perfect GLM5.2, awesome token/s, TTFT and price.

      Coupled with a local Headroom (https://github.com/headroomlabs-ai/headroom) you'll be able to use a LOT without hitting your 5h window :)

      Definitely the best $ value for me considering the reasonable performance of GLM5.2.

      They provide a rolling window quota, so you're never really out of quota contrary to other providers, you can adjust day to day.

      Check it out if interested : https://synthetic.new/?referral=kwjqga9QYoUgpZV

      ---

      Docs & all models : https://dev.synthetic.new/docs/api/models

  • leumon 10 hours ago ago

    I've seen glm 5.2 struggle writing simple compilable c code. It might be good at web, but it's world knowledge is limited due to the small model size, making it's use quite limited in my opinion.

  • poulpy123 8 hours ago ago

    What would the best way to use these open source models for a price similar to what I could pay for the cheapest plan with claude and openai ?

    I would like to give them a try but I certainly not have the money to get a system able to run them, and I don't really want to pay more than the state of the art

  • jofzar 10 hours ago ago

    Great article,

    My only, I guess feedback, is that it's not really clear about the price.

    Would the 21.92 be the API pricing I guess?

    Cost $5.39 (real billed) ~$21.92 (estimate, list pricing)

  • orloffm 5 hours ago ago

    > 256 GiB unified RAM.

    So, 8000$, plus it's unavailable. 3 years of Codex/Opus subscription.

    > API prices

    Which are irrelevant for 200$ Codex/Opus plans that are times cheaper.

  • maccard 5 hours ago ago

    Are these games supposed to be a good example of quality output? If this is the product, I don't really want to play _either_ of them.

  • lordforever 4 hours ago ago

    i think inference is the thing, that also fast inference, so enterprises can just host their own and run, ig vercel do it, many more would. but zs it thinks toooo much idk how fast we can make it.

  • wejick 10 hours ago ago

    Totally agree witg the general assessment. The biggest problem with Z.ai model for a long time is not quality, but the inference speed and general capacity availability. Hopefully with this recent hype, there will be more provider on openrouter for 5.2.

  • IronWolve 10 hours ago ago

    Having issues with coding a render for good looking realistic smoke coming off burning incense, opus 4.8 & gpt-5.5 both have code issues, glm-5.2 did it. Amazing.

    The real time 3d fluid dynamics appear to be the tricky part, I wish I still had opus access, would love to see if it can do it.

    • stavros 7 hours ago ago

      You mean Fable?

  • close2 10 hours ago ago

    I wonder how much tokens and time where used for the verifying part. Maybe GLM 5.2 instantly found the "solution" to read the screen pixel by pixel, but it could also have been a major token and time consumer.

    • jameswhitford 9 hours ago ago

      Hi, author here, I cannot give an exact number for how many token the verification step took, but the verification GLM 5.2 ran was very stupid and definitely a waste of time. It read the pixel color data to try and verify the scene rendered properly. Which is really bad. Opus opened the game in a Playwright browser and took screenshots to verify the actual image. Which helped a lot.

      Pro tip: You could use a multi-modal model to verify images as a subagent spawned by GLM 5.2, to get around this issue.

      • 59nadir an hour ago ago

        That's a dumb way to do it, it should just write the frame buffer to a PNG instead of taking screenshots. I guess you can't take the dumb web developer ways out of these models at the end of the day.

    • trick-or-treat 10 hours ago ago

      I could be wrong but I believe this is a non-vision model. Please weigh in to correct me bc I would love to be wrong

      • jameswhitford 9 hours ago ago

        GLM 5.2 is text only, not multi modal. And Opus is multi modal.

  • Havoc 9 hours ago ago

    Still on a z.ai legacy plan and their 50% discount for switching to standard plans tips the balance for me. So I guess I’ll reevaluate round about beginning 2028…

  • _pdp_ 10 hours ago ago

    In the name of science we crafted an autonomous AI agent that builds games on a loop. It is based on GLM 5.2.

    I am not sure where this is going to lead us but it is fun to watch.

  • jdright 5 hours ago ago

    not apples to apples. comparing official vs. pi.dev+openrouter and having slow times is more a openrouter issue. try comparing using official z.ai.

  • efficax 5 hours ago ago

    glm-5.2 is very good if you have a good harness and workflow to use it with. in fact, i'd call it good enough if you are a software engineer who knows what you want. it writes the code. i'm wondering if i need anthropic's models at all at this point, or openai. and surely in a year we won't need them at all. Opus 4.5+ was the turning point for me, and now these open models are just as good. i don't get how you IPO these companies when their only winning product is coding agents and the competition is just as good for 1/4 the price.

  • wolttam 6 hours ago ago

    Would have run it with GLM on max/xhigh effort. Just for fun.

  • cwoolfe 5 hours ago ago

    The model is 756B parameters, open weights.

  • linzhangrun 10 hours ago ago

    Just that their Coding Plan is too hard to get. I've been trying to grab it for a week and still can't get it

  • speedgoose 10 hours ago ago

    While this is interesting, one single sample with different coding harness is not very scientific.

    • jameswhitford 9 hours ago ago

      Yes I agree 100%. My next guide would do better to use identical harnesses.

  • aykutseker 6 hours ago ago

    The text only part is the catch for me.

    If it builds a UI and can't look at it, it's askin ls whether the app looks right.

  • taosu_la 8 hours ago ago

    I'm really feeling a bit tired of these models. I feel that since opus 4.1, I haven't been able to clearly feel the intelligence improvement from the model upgrades (except for gpt 5.5 and opus4.6 being able to speak like a human)

  • yanhangyhy 9 hours ago ago

    i think GLM 5.2 is not cheap and not easy to get the coding plan... so even it's on the Opus level... still not attractive.

    • Mashimo 7 hours ago ago

      I used GLM (4.7, 5 and 5.1) both through their coding plan and now through OpenCode zen. Both where painless to get.

    • LUmBULtERA 7 hours ago ago

      How is it not easy to get the coding plan?

    • jiri 9 hours ago ago

      opencode zen go?

  • elzbardico 4 hours ago ago

    If you are a real engineer and uses the LLM as a pair programmer instead of delegating everything to it, even GLM 4.7 was already good enough to help you with a lot of work.

    I used it with Cerebras inference at a time when it had a good coding plan at a low price, and delivered tons of stuff using it.

  • NicoJuicy 5 hours ago ago

    For those praising GLM 5.2, can anyone confirm?

    Tried with 2 harnesses and it seems bad + slow

  • _s_a_m_ 8 hours ago ago

    GLM is the most overrated LLM. I tried it and it not good.

  • ukprogrammer 9 hours ago ago

    GLM cannot use vision like Opus can. This is not a useful comparison.

    • jameswhitford 9 hours ago ago

      I see your point. Just the fact that one model does have vision and one does not might be an interesting point of comparison, however.

  • sourcecodeplz 10 hours ago ago

    What is this fashion of testing models by giving them one shot projects? Especially games. this is so stupid

  • msejas 10 hours ago ago

    Seeing the results I don't see how the results are even comparable Opus is clearly far superior in most aspects. Smoothness, design, functionality etc.

    At the end of the day, the time earned is more important then the cost for big players.

    The ability to spawn 10 claude agents and rush a project to outcompete someone is more important for big businesses in my imo. Also the small details that GLM missed would take significant more time to iron out, considering it already took double the time.

    I do hope other (open weight) models catch up, but to act like they are anywhere close for me is a bit disingenuous.

  • camillomiller 8 hours ago ago

    I swear, if I read forms with “genuinely” one more time I am gonna scream. FUCK LLM WRITING

  • joshrw 10 hours ago ago

    Chinese models optimize for benchmarks and do poorly in real-world tasks

    • epolanski 10 hours ago ago

      Not my experience at all, I have written about comparing DS4 vs Opus 4.8 on 16 real life work tasks on multiple posts.

      Also, every single lab does RL on benchmarks, which is why Opus 4.6 was the last truly great assistant, after it, all models tend to drift into implementation asap.