179 comments

  • Aurornis 6 hours ago ago

    If you're new to this: All of the open source models are playing benchmark optimization games. Every new open weight model comes with promises of being as good as something SOTA from a few months ago then they always disappoint in actual use.

    I've been playing with Qwen3-Coder-Next and the Qwen3.5 models since they were each released.

    They are impressive, but they are not performing at Sonnet 4.5 level in my experience.

    I have observed that they're configured to be very tenacious. If you can carefully constrain the goal with some tests they need to pass and frame it in a way to keep them on track, they will just keep trying things over and over. They'll "solve" a lot of these problems in the way that a broken clock is right twice a day, but there's a lot of fumbling to get there.

    That said, they are impressive for open source models. It's amazing what you can do with self-hosted now. Just don't believe the hype that these are Sonnet 4.5 level models because you're going to be very disappointed once you get into anything complex.

    • kir-gadjello 5 hours ago ago

      Respectfully, from my experience and a few billions of tokens consumed, some opensource models really are strong and useful. Specifically StepFun-3.5-flash https://github.com/stepfun-ai/Step-3.5-Flash

      I'm working on a pretty complex Rust codebase right now, with hundreds of integration tests and nontrivial concurrency, and stepfun powers through.

      I have no relation to stepfun, and I'm saying this purely from deep respect to the team that managed to pack this performance in 196B/11B active envelope.

      • copperx an hour ago ago

        Are you using stepfun mostly because it's free, or is it better than other models at some things?

      • aappleby 5 hours ago ago

        What are you running that model on?

        • kir-gadjello 5 hours ago ago

          I just use openrouter, it's free for now. But I would pay 30-100$ to use it 24/7.

          • aappleby 4 hours ago ago

            Ah, I thought you meant you were running it locally.

        • FuckButtons 3 hours ago ago

          A 3 bit quant will run on a 128gb MacBook Pro, it works pretty well.

          • nl 3 hours ago ago

            A 3 bit quant is quite a lot weaker than the OpenRouter version the OP is using.

    • dimgl 3 hours ago ago

      I'm using Qwen 3.5 27b on my 4090 and let me tell you. This is the first time I am seriously blown away by coding performance on a local model. They are almost always unusable. Not this time though...

    • wolvoleo 5 hours ago ago

      All models are doing that. Not only the open source ones.

      I bet the cloud ones are doing it a lot more because they can also affect the runtime side which the open source ones can't.

      • red75prime 44 minutes ago ago

        I wouldn't mind them benchmaxing my queries.

    • chaboud 5 hours ago ago

      "When a measure becomes a target, it ceases to be a good measure."

      Goodhart's law shows up with people, in system design, in processor design, in education...

      Models are going to be over-fit to the tests unless scruples or practical application realities intervene. It's a tale as old as machine learning.

    • rudhdb773b 4 hours ago ago

      Are there any up-to-date offline/private agentic coding benchmark leaderboards?

      If the tests haven't been published anywhere and are sufficiently different from standard problems, I would think the benchmarks would be robust to intentional over optimization.

      Edit: These look decent and generally match my expectations:

      https://www.apex-testing.org/

    • noosphr 5 hours ago ago

      It's not just the open source ones.

      The only benchmarks worth anything are dynamic ones which can be scaled up.

    • crystal_revenge 4 hours ago ago

      > they always disappoint in actual use.

      I’ve switched to using Kimi 2.5 for all of my personal usage and am far from disappointed.

      Aside from being much cheaper than the big names (yes, I’m not running it locally, but like that I could) it just works and isn’t a sycophant. Nice to get coding problems solved without any “That’s a fantastic idea!”/“great point” comments.

      At least with Kimi my understanding is that beating benchmarks was a secondary goal to good developer experience.

    • amelius 6 hours ago ago

      Are you saying that the benchmarks are flawed?

      And could quantization maybe partially explain the worse than expected results?

      • TrainedMonkey 6 hours ago ago

        No, what he is saying is that benchmarks are static and there is tremendous reputational and financial pressure to make benchmark number go up. So you add specific problems to training data... The result is that the model is smarter, but the benchmarks overstate the progress. Sure there are problem sets designed to be secret, but keeping secrets is hard given the fraction of planetary resources we are dedicating to making the AI numbers go up.

        I have two of my own comments to add to that. First one is that there is problem alignment at play. Specifically - the benchmarks are mostly self-contained problems with well defined solutions and specific prompt language, humans tasks are open ended with messy prompts and much steerage. Second is that it would be interesting to test older models on brand new benchmarks to see how those compare.

        • Aurornis 5 hours ago ago

          > No, what he is saying is that benchmarks are static and there is tremendous reputational and financial pressure to make benchmark number go up.

          That's a much better way to say it than I did.

          These models are known for being open weights but they're still products that Alibaba Cloud wants is trying to sell. They have Product Managers and PR and marketing people under pressure to get people using them.

          This Venture Beat article is basically a PR piece for the models and Alibaba Cloud hosting. The pricing table is right in the article.

          It's cool that they release the models for us to use, but don't think they're operating entirely altruistically. They're playing a business game just like everyone else.

        • amelius 5 hours ago ago

          There should be a way to turn the questions we ask LLMs into benchmarks.

          That way, we can have a benchmark that is always up to date.

      • Aurornis 6 hours ago ago

        The models outperform on the benchmarks relative to general tasks.

        The benchmarks are public. They're guaranteed to be in the training sets by now. So the benchmarks are no longer an indicator of general performance because the specific tasks have been seen before.

        > And could quantization maybe explain the worse than expected results?

        You can use the models through various providers on OpenRouter cheaply without quantization.

      • girvo 6 hours ago ago

        Flawed? Possibly, but I think it's more that any kind of benchmark then becomes a target, and is inherently going to be a "lossy" signal as to the models actual ability in practice.

        Quantisation doesn't help, but even running full fat versions of these models through various cloud providers, they still don't match Sonnet in actual agentic coding uses: at least in my experience.

    • eurekin 5 hours ago ago

      Very good point. I'm playing with them too and got to the same conclusion.

    • ekianjo 2 hours ago ago

      > That said, they are impressive for open source models.

      there is nothing open "source" about them. They are open weights, that's all.

    • jackblemming 5 hours ago ago

      Death by KPIs. Management makes it too risky to do anything but benchmaxx. It will be the death of American AI companies too. Eventually, people will notice models aren’t actually getting better and the money will stop flowing. However, this might be a golden age of research as cheap GPUs flood the market and universities have their own clusters.

  • mstaoru 7 hours ago ago

    I periodically try to run these models on my MBP M3 Max 128G (which I bought with a mind to run local AI). I have a certain deep research question (in a field that is deeply familiar to me) that I ask when I want to gauge model's knowledge.

    So far Opus 4.6 and Gemini Pro are very satisfactory, producing great answers fairly fast. Gemini is very fast at 30-50 sec, Opus is very detailed and comes at about 2-3 minutes.

    Today I ran the question against local qwen3.5:35b-a3b - it puffed for 45 (!) minutes, produced a very generic answer with errors, and made my laptop sound like it's going to take off any moment.

    Wonder what am I doing wrong?.. How am I supposed to use this for any agentic coding on a large enough codebase? It will take days (and a 3M Peltor X5A) to produce anything useful.

    • lm28469 7 hours ago ago

      > Wonder what am I doing wrong?

      You're comparing 100b parameters open models running on a consumer laptop VS private models with at the very least 1t parameters running on racks of bleeding edge professional gpus

      Local agentic coding is closer to "shit me the boiler plate for an android app" not "deep research questions", especially on your machine

      • vlovich123 7 hours ago ago

        The hardware difference explains runtime performance differences, not task performance.

        Speculation is that the frontier models are all below 200B parameters but a 2x size difference wouldn’t fully explain task performance differences

        • 827a 3 hours ago ago

          He's running a 35B parameter model. Frontier models are well over a trillion parameters at this point. Parameters = smarts. There are 1T+ open source models (e.g. GLM5), and they're actually getting to the point of being comparable with the closed source models; but you cannot remotely run them on any hardware available to us.

          Core speed/count and memory bandwidth determines your performance. Memory size determines your model size which determines your smarts. Broadly speaking.

        • nl 3 hours ago ago

          > Speculation is that the frontier models are all below 200B parameters

          Some versions of some the models are around that size, which you might hit for example with the ChatGPT auto-router.

          But the frontier models are all over 1T parameters. Source: watch interview with people who have left one of the big three labs and now work at the Chinese labs and are talking about how to train 1T+ models.

        • NamlchakKhandro 4 hours ago ago

          > The hardware difference explains runtime performance differences, not task performance.

          Yes it does.

        • ses1984 6 hours ago ago

          Who would have thought ai labs with billions upon billions of r&d budget would have better models than a free alternative.

      • delaminator 6 hours ago ago

        Looks at the headline: Qwen3.5 122B and 35B models offer Sonnet 4.5 performance on local computers

        • lm28469 6 hours ago ago

          Yes and Devstral 2 24b q4 is supposed to be 90% as good but it can't even reliably write to a file on my machine.

          There are the benchmarks, the promises, and what everybody can try at home

          • 8note 6 hours ago ago

            maybe a harness problem?

    • Paddyz 3 hours ago ago

      The 35b-a3b model is misleading in its naming - it's a MoE with only 3B active parameters per forward pass. You're essentially running a 3B-class model for inference quality while paying the memory cost of loading 35B parameters. That's why it feels so much worse than Opus or Gemini, which are likely 10-100x larger in effective compute per token.

      For your M3 Max 128G setup, try Qwen3.5-122B-A10B with a 4-bit quantization instead (should fit in ~50-60GB). 10B active params is a massive step up from 3B and you'll actually see the quality difference people are talking about. MLX versions specifically optimized for Apple Silicon will also give you noticeably better tok/s than running through ollama.

      The general rule I've settled on: MoE models with <8B active params are great for structured tasks (reformatting, classification, simple completions) but fall apart on anything requiring deep reasoning or domain knowledge. For your research question use case, you want either a dense 27B+ model or a MoE with 10B+ active params.

    • aspenmartin 7 hours ago ago

      Well Opus and Gemini are probably running on multiple H200 equivalents, maybe multiple hundreds of thousands of dollars of inference equipment. Local models are inherently inferior; even the best Mac that money can buy will never hold a candle to latest generation Nvidia inference hardware, and the local models, even the largest, are still not quite at the frontier. The ones you can plausibly run on a laptop (where "plausible" really is "45 minutes and making my laptop sound like it is going to take off at any moment". Like they said -- you're getting sonnet 4.5 performance which is 2 generations ago; speaking from experience opus 4.6 is night and day compared to sonnet 4.5

      • zozbot234 7 hours ago ago

        > Well Opus and Gemini are probably running on multiple H200 equivalents, maybe multiple hundreds of thousands of dollars of inference equipment.

        But if you've got that kind of equipment, you aren't using it to support a single user. It gets the best utilization by running very large batches with massive parallelism across GPUs, so you're going to do that. There is such a thing as a useful middle ground. that may not give you the absolute best in performance but will be found broadly acceptable and still be quite viable for a home lab.

        • aspenmartin 6 hours ago ago

          Batching helps with efficiency but you can’t fit opus into anything less than hundreds of thousands of dollars in equipment

          Local models are more than a useful middle ground they are essential and will never go away, I was just addressing the OPs question about why he observed the difference he did. One is an API call to the worlds most advanced compute infrastructure and another is running on a $500 CPU.

          Lots of uses for small, medium, and larger models they all have important places!!

    • xtn 2 hours ago ago

      I think knowledge of frontier research certainly scale with number of parameters. Also, US labs can pay more money to have researchers provide training data on these frontier research areas.

      On the other hand, if indeed open source models and Macbooks can be as powerful as those SOTA models from Google, etc, then stock prices of many companies would already collapsed.

    • wolvoleo 5 hours ago ago

      Well first of all you're running a long intense task on a thermally constrained machine. Your MacBook Pro is optimised for portability and battery life, not max performance under load. And apple's obsession with thinness overrules thermal performance for them. Short peaks will be ok but a 45 minute task will thoroughly saturate the cooling system.

      Even on servers this can happen. At work we have a 2U sized server with two 250W class GPUs. And I found that by pinning the case fans at 100% I can get 30% more performance out of GPU tasks which translates to several days faster for our usecase. It does mean I can literally hear the fans screaming in the hallway outside the equipment room but ok lol. Who cares. But a laptop just can't compare.

      Something with a desktop GPU or even better something with HBM3 would run much better. Local models get slow when you use a ton of context and the memory bandwidth of a MacBook Pro while better than a pc is still not amazing.

      And yeah the heaviest tasks are not great on local models. I tend to run the low hanging fruit locally and the stuff where I really need the best in the cloud. I don't agree local models are on par, however I don't think they really need to be for a lot of tasks.

      • pamcake 4 hours ago ago

        To your point, one can get a great performance boost by propping the laptop onto a roost-like stand in front of a large fan. Nothing like a cooling system actually built for sustained load but still.

    • meatmanek 2 hours ago ago

      I've seen reports of qwen3.5-35b-a3b spending a ton of time reasoning if the context window is nearly empty-- supposedly it reasons less if you provide a long system prompt or some file contents, like if you use it in a coding agent.

      I'm too GPU-poor to run it, but r/LocalLLaMa is full of people using it.

    • __mharrison__ 6 hours ago ago

      Were you using mlx-lm? I've had good performance with that on Macs. (Sadly, the lead developer just left Apple.)

      Admittedly, I haven't tried these models on my Mac, but I have on my DGX Spark, and they ran fine. I didn't see the slowdown you're mentioning.

    • zozbot234 7 hours ago ago

      Running local AI models on a laptop is a weird choice. The Mini and especially the Studio form factor will have better cooling, lower prices for comparable specs and a much higher ceiling in performance and memory capacity.

      • stavros 7 hours ago ago

        I can never see the point, though. Performance isn't anywhere near Opus, and even that gets confused following instructions or making tool calls in demanding scenarios. Open weights models are just light years behind.

        I really, really want open weights models to be great, but I've been disappointed with them. I don't even run them locally, I try them from providers, but they're never as good as even the current Sonnet.

        • vunderba 6 hours ago ago

          I can't speak to using local models as agentic coding assistants, but I have a headless 128GB RAM machine serving llama.cpp with a number of local models that I use on a daily basis.

          - Qwen3-VL picks up new images in a NAS, auto captions and adds the text descriptions as a hidden EXIF layer into the image, which is used for fast search and organization in conjunction with a Qdrant vector database.

          - Gemma3:27b is used for personal translation work (mostly English and Chinese).

          - Llama3.1 spins up for sentiment analysis on text.

          • stavros 5 hours ago ago

            Ah yeah, self-contained tasks like these are ideal, true. I'm more using it for coding, or for running a personal assistant, or for doing research, where open weights models aren't as strong yet.

            • vunderba 5 hours ago ago

              Understood. Research would make me especially leery; I’d be afraid of losing any potential gains as I'd feel compelled to always go and validate its claims (though I suppose you could mitigate it a little bit with search engine tooling like Kagi's MCP system).

        • andoando 7 hours ago ago

          They're great for some product use cases where you dont need frontier models.

          • stavros 7 hours ago ago

            Yeah, for sure, I just don't have many of those. For example, the only use I have for Haiku is for summarizing webpages, or Sonnet for coding something after Opus produces a very detailed plan.

            Maybe I should try local models for home automation, Qwen must be great at that.

        • lm28469 7 hours ago ago

          They're like 6 months away on most benchmarks, people already claimed coding wad solved 6 months ago, so which is it? The current version is the baseline that solves everything but as soon as the new version is out it becomes utter trash and barely usable

          • zozbot234 7 hours ago ago

            That's very large models at full quantization though. Stuff that will crawl even on a decent homelab, despite being largely MoE based and even quantization-aware, hence reducing the amount and size of active parameters.

          • stavros 7 hours ago ago

            That's just a straw man. Each frontier model version is better than the previous one, and I use it for harder and harder things, so I have very little use for a version that's six months behind. Maybe for simple scripts they're great, but for a personal assistant bot, even Opus 4.6 isn't as good as I'd like.

      • satvikpendem 6 hours ago ago

        I can take a laptop on the train.

      • wat10000 6 hours ago ago

        I have a laptop already, so that's what I'm going to use.

    • notreallya 7 hours ago ago

      Sonnet 4.5 level isn't Opus 4.6 level, simple as

    • rienko 6 hours ago ago

      use a larger model like Qwen3.5-122B-A10B quantized to 4/5/6 bits depending on how much context you desire, MLX versions if you want best tok/s on Mac HW.

      if you are able to run something like mlx-community/MiniMax-M2.5-3bit (~100gb), my guess if the results are much better than 35b-a3b.

    • culi 7 hours ago ago

      Well you can't run Gemini Pro or Opus 4.6 locally so are you comparing a locally run model to cloud platforms?

    • furyofantares 7 hours ago ago

      Can you try asking Sonnet 4.5 the same question, since that is what this model is claimed to be on par with?

    • gigatexal 2 hours ago ago

      I have the exact same hardware. Was going to do the same thing with the 122B model … I’ll just keep paying Anthropic and he models are just that good. Trying out Gemini too. But won’t pay OpenAI as they’re going to be helping Pete Kegseth to develop autonomous killing machines.

    • andxor 6 hours ago ago

      You're not doing anything wrong. The Chinese models are not as good as advertised. Surprise surprise!

    • CamperBob2 6 hours ago ago

      Try the 27B dense model. It will likely do much better than the 35b MoE with only 3B active experts.

      Also, performance on research-y questions isn't always a good indicator of how the model will do for code generation or agent orchestration.

  • Paddyz 3 hours ago ago

    The benchmark gaming discussion is valid but I think it obscures a more interesting trend: the gap between open and closed models is closing fastest in specific, constrained domains rather than general intelligence.

    I run a multi-model setup for production work - routing different tasks to different models based on what actually works, not benchmark scores. What I've found over the past 6 months:

    - Structured output (JSON extraction, classification, reformatting): open models at 30B+ are genuinely competitive. Qwen and Gemma both handle this reliably.

    - Code completion and inline suggestions: small MoE models are surprisingly good here because the task is inherently constrained by surrounding context.

    - Complex multi-step reasoning, long-horizon planning, or tasks requiring deep domain knowledge: frontier closed models still win by a significant margin.

    The practical takeaway isn't "open models match Sonnet" or "benchmarks are lies" - it's that the cost-performance frontier has shifted enough that a thoughtful routing strategy (cheap/local model for 70% of tasks, frontier API for the 30% that actually needs it) is now viable in a way it wasn't even 6 months ago. That's the real story here, not whether Qwen exactly matches Sonnet on some leaderboard.

  • jackcosgrove 4 hours ago ago

    I am a total neophyte when it comes to LLMs, and only recently started poking around into the internals of them. The first thing that struck me was that float32 dimensions seemed very generous.

    I then discovered what quantization is by reading a blog post about binary quantization. That seemed too good to be true. I asked Claude to design an analysis assessing the fidelity of 1, 2, 4, and 8 bit quantization. Claude did a good job, downloading 10,000 embeddings from a public source and computing a similarity score and correlation coefficient for each level of quantization against the float32 SoT. 1 and 2 bit quantizations were about 90% similar and 8 bit quantization was lossless given the precision Claude used to display the results. 4 bit was interesting as it was 99% similar (almost lossless) yet half the size of 8 bit. It seemed like the sweet spot.

    This analysis took me all of an hour so I thought, "That's cool but is it real?" It's gratifying to see that 4 bit quantization is actually being used by professionals in this field.

    • deepsquirrelnet 4 hours ago ago

      4-bit quantization on newer nvidia hardware is being supported in training as well these days. I believe the gpt-oss models were trained natively in MXFP4, which is a 4-bit floating point / e2m1 (2-exponent, 1 bit mantissa, 1 bit sign).

      It doesn't seem terribly common yet though. I think it is challenging to keep it stable.

      [1] https://www.opencompute.org/blog/amd-arm-intel-meta-microsof...

      [2] https://www.opencompute.org/documents/ocp-microscaling-forma...

    • silisili 3 hours ago ago

      Mind sharing any resources? I've been thinking about trying to understand them better myself.

    • tymscar 4 hours ago ago

      Thats cool.

      I do wonder where that extra acuity you get from 1% more shows up in practice. I hate how I have basically no way to intuitively tell that because of how much of a black box the system is

      • doctorpangloss 4 hours ago ago

        Well why would Claude know any of this? Obviously it's the wrong criteria. If you have your own dataset to benchmark, created your own calibration for quantization with it. Scientifically, you wouldn't really believe in the whole process of gradient descent if you didn't think tiny differences in these values matter. So...

        • tymscar 4 hours ago ago

          I think you might be answering to a different person or misunderstanding what I said but you are right that just as I don’t have an intuition for where the acuity shows up in the corpus, I don’t think Claude does either

  • alexpotato 8 hours ago ago

    I recently wrote a guide on getting:

    - llama.cpp

    - OpenCode

    - Qwen3-Coder-30B-A3B-Instruct in GGUF format (Q4_K_M quantization)

    working on a M1 MacBook Pro (e.g. using brew).

    It was bit finicky to get all of the pieces together so hopefully this can be used with these newer models.

    https://gist.github.com/alexpotato/5b76989c24593962898294038...

    • freeone3000 7 hours ago ago

      We can also run LM Studio and get it installed with one search and one click, exposed through an OpenAI-compatible API.

    • kpw94 7 hours ago ago

      On my 32GB Ryzen desktop (recently upgraded from 16GB before the RAM prices went up another +40%), did the same setup of llama.cpp (with Vulkan extra steps) and also converged on Qwen3-Coder-30B-A3B-Instruct (also Q4_K_M quantization)

      On the model choice: I've tried latest gemma, ministral, and a bunch of others. But qwen was definitely the most impressive (and much faster inference thanks to MoE architecture), so can't wait to try Qwen3.5-35B-A3B if it fits.

      I've no clue about which quantization to pick though ... I picked Q4_K_M at random, was your choice of quantization more educated?

      • zargon 4 hours ago ago

        Quant choice depends on your vram, use case, need for speed, etc. For coding I would not go below Q4_K_M (though for Q4, unsloth XL or ik_llama IQ quants are usually better at the same size). Preferably Q5 or even Q6.

    • robby_w_g 7 hours ago ago

      Does your MBP have 32 GB of ram? I’m waiting on a local model that can run decently on 16 GB

    • copperx 8 hours ago ago

      How fast does it run on your M1?

  • jjcm 5 hours ago ago

    Getting better, but definitely not there yet, nor near Sonnet 4.5 performance.

    What these open models are great for are for narrow, constrained domains, with good input/output examples. I typically use them for things like prompt expansion, sentiment analysis, reformatting or re-arranging flow of code.

    What I found they have trouble with is going from ambiguous description -> solved problem. Qwen 3.5 is certainly the best of the OSS models I've found (beating out GPT 120b OSS which was the previous king), and it's just starting to demonstrate true intelligence in unbound situations, but it isn't quite there yet. I have a RTX 6000 pro, so Qwen 3.5 is free for me to run, but I tend to default to Composer 1.5 if I want to be cheap.

    The trend however is super encouraging. I bought my vid card with the full expectation that we'll have a locally running GPT 5.2 equiv by EoY, and I think we're on track.

  • solarkraft 8 hours ago ago

    Smells like hyperbole. A lot of people making such claims don’t seem to have continued real world experience with these models or seem to have very weird standards for what they consider usable.

    Up until relatively recently, while people had already long been making these claims, it came with the asterisks of „oh, but you can’t practically use more than a few K tokens of context“.

    • derekp7 7 hours ago ago

      "Create a single page web app scientific RPN calculator"

      Qwen 3.5 122b/a10b (at q3 using unsloth's dynamic quant) is so far the first model I've tried locally that gets a really usable RPN calculator app. Other models (even larger ones that I can run on my Strix Halo box) tend to either not implement the stack right, have non-functional operation buttons, or most commonly the keypad looks like a Picasso painting (i.e., the 10-key pad portion has buttons missing or mapped all over the keypad area).

      This seems like such as simple test, but I even just tried it in chatgpt (whatever model they serve up when you don't log in), and it didn't even have any numerical input buttons. Claude Sonet 4.6 did get it correct too, but that is the only other model I've used that gets this question right.

      • rienko 6 hours ago ago

        We tend to find Qwen3-Coder-Next better at coding at least on our anecdotal examples from our codebases. It's somewhat better at tool calling, maybe the current templates for Qwen3.5 are still not enjoying as "mature" support as Qwen3 on vllm. I can say in my team MiniMax2.5 is the currently favorite.

      • airstrike 4 hours ago ago

        is your prompt literally 1-sentence?

        if so, a better approach would be to ask it to first plan that entire task and give it some specific guidance

        then once it has the plan, ask it to execute it, preferably by letting it call other subagents that take care of different phases of the implementation while the main loop just merges those worktrees back

        it's how you should be using claude code too, btw

        • nl 3 hours ago ago

          Claude Sonnet can easily one-shot that without specifically asking for plan first.

    • tempest_ 8 hours ago ago

      Qwen3-Coder-30B-A3B-Instruct is good I think for in line IDE integration or operating on small functions or library code but I dont think you will get too far with one shot feature implementation that people are currently doing with Claude or whatever.

      • andy_ppp 7 hours ago ago

        I have been adding a one shot feature to a codebase with ChatGPT 5.3 Codex in Cursor and it worked out of the box but then I realised everything it had done was super weird and it didn't work under a load of edge cases. I've tried being super clear about how to fix it but the model is lost. This was not a complex feature at all so hopefully I'm employed for a few more years yet.

      • rubyn00bie 6 hours ago ago

        I could be doing something wrong, but I have not had any success with one shot feature implementations for any of the current models. There are always weird quirks, undesired behaviors, bad practices, or just egregiously broken implementations. A week or so ago, I had instructed Claude to do something at compile-time and it instead burned a phenomenal amount of tokens before yeeting the most absurd, and convoluted, runtime implementation—- that didn’t even work. At work I use it (or Codex) for specific tasks, delegating specific steps of the feature implementation.

        The more I use the cloud based frontier models, the more virtue I find in using local, open source/weights, models because they tend to create much simpler code. They require more direct interaction from me, but the end result tends to be less buggy, easier to refactor/clean up, and more precisely what I wanted. I am personally excited to try this new model out here shortly on my 5090. If read the article correctly, it sounds like even the quantized versions have a “million”[1] token context window.

        And to note, I’m sure I could use the same interaction loop for Claude or GPT, but the local models are free (minus the power) to run.

        [1] I’m a dubious it won’t shite itself at even 50% of that. But even 250k would be amazing for a local model when I “only” have 32GB of VRAM.

    • __mharrison__ 6 hours ago ago

      I used the 35b model to create a polars implementation of PCA (no sklearn or imports other than math and polars). In less than 10 minutes I had the code. This is impressive to me considering how poorly all models were with polars until very recently. (They always hallucinated pandas code.)

  • lubitelpospat an hour ago ago

    All right guys, this is your time - what consumer device do you use for local LLM inference? GPU poor answers only

  • xmddmx 6 hours ago ago

    Ollama users: there are notable bugs with ollama and Qwen3.5 so don't let your first impression be the last.

    Theory is that some of the model parameters aren't set properly and this encourages endless looping behavior when run under ollama:

    https://github.com/ollama/ollama/issues?q=is%3Aissue%20state... (a bunch of them)

  • nu11ptr 7 hours ago ago

    Thinking about getting a new MBP M5 Max 128GB (assuming they are released next week). I know "future proofing" at this stage is near impossible, but for writing Rust code locally (likely using Qwen 3.5 for now on MLX), the AIs have convinced me this is probably my best choice for immediate with some level of longevity, while retaining portability (not strictly needed, but nice to have). Alternatively was considering RTX options or a mac studio, but was leaning towards apple for the unified memory. What does HN think?

    • nl 3 hours ago ago

      Strix Halo machines are a good option too if you are at all price sensitive. AMD (with all the downsides of that for AI work) but people are getting decent performance from them.

      Also Nvidia Spark.

    • cmenge 5 hours ago ago

      I've been mulling the same, but decided against (for now)

      Using Claude Code Max 20 so ROI would be maybe 2+ years.

      CC gives me unlimited coding in 4-6 windows in parallel. Unsure if any model would beat (or even match) that, both in terms in quality and speed.

      I wouldn't gamble on that now. With a subscription, I can change any time. With the machine, you risk that this great insane model comes out but you need 138GB and then you'll pay for both.

    • shell0x 3 hours ago ago

      I have a Mac Studio with 128GB and a M4 Max and I'd recommend it. The power usage is also pretty good, but you may not care if you live somewhere where energy is cheap.

    • pamcake 4 hours ago ago

      > What does HN think?

      Thermals. Your workloads will be throttled hard once it inevitably runs hot. See comments elsewhere in thread about why LLMs on laptops like MBP is underwhelming. The same chips in even a studio form factor would perform much better.

  • solarkraft 8 hours ago ago

    What are the recommended 4 bit quants for the 35B model? I don’t see official ones: https://huggingface.co/models?other=base_model:quantized:Qwe...

    Edit: The unsloth quants seem to have been fixed, so they are probably the go-to again: https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

  • sunkeeh 7 hours ago ago

    Qwen3.5-122B-A10B BF16 GGUF = 224GB. The "80Gb VRAM" mentioned here will barely fit Q4_K_S (70GB), which will NOT perform as shown on benchmarks.

    Quite misleading, really.

    • CamperBob2 6 hours ago ago

      The larger 3.5 quants are actually pretty close to the full-blown 397B model's performance, at least looking at the numbers. Qwen 3.5 seems more tolerant of quantization than most.

  • shell0x 3 hours ago ago

    Can't wait to try that out locally. Keen to reduce my dependence on American products and services.

  • oscord 6 hours ago ago

    SWE chart is missing Claude on front page, interesting way to present your data. Mix and match at will. Grown up people showing public school level sneakiness. That fact alone disqualifies your LL. Business/marketing leaders usually are brighter than average developers... so there.

  • syntaxing 5 hours ago ago

    A big part that a lot of local users forget is inference is hard. Maybe you have the wrong temperature. Maybe you have the wrong min P. Maybe you have the wrong template. Maybe the implementation in llama cpp has a bug. Maybe Q4 or even Q8 just won’t compare to BF16. Reality is, there’s so many knobs to LLM inferencing and any can make the experience worse. It’s not always the model’s fault.

  • mark_l_watson 8 hours ago ago

    The new 35b model is great. That said, it has slight incompatibility's with Claude Code. It is very good for tool use.

    • johnnyApplePRNG 8 hours ago ago

      Claude code is designed for anthropic models. Try it with opencode!

      • mark_l_watson 4 hours ago ago

        I will, right now.

        EDIT: opencode was a bit slow with qwen3.5:35b using Ollama. Faster/nicer to use with Liquid lfm2:latest

        • johnnyApplePRNG 3 hours ago ago

          Try llama.cpp - it usually excels with these MoE models imho.

      • kristianpaul 8 hours ago ago

        Or Pi

    • stavros 7 hours ago ago

      Have you tried the 122B one?

  • erelong 8 hours ago ago

    What kind of hardware does HN recommend or like to run these models?

    • suprjami 8 hours ago ago

      The cheapest option is two 3060 12G cards. You'll be able to fit the Q4 of the 27B or 35B with an okay context window.

      If you want to spend twice as much for more speed, get a 3090/4090/5090.

      If you want long context, get two of them.

      If you have enough spare cash to buy a car, get an RTX Ada with 96G VRAM.

      • chr15m 2 hours ago ago

        Thanks this is a great summary of the tradeoffs!

      • barrkel 7 hours ago ago

        Rtx 6000 pro Blackwell, not ada, for 96GB.

    • dajonker 8 hours ago ago

      Radeon R9700 with 32 GB VRAM is relatively affordable for the amount of RAM and with llama.cpp it runs fast enough for most things. These are workstation cards with blower fans and they are LOUD. Otherwise if you have the money to burn get a 5090 for speeeed and relatively low noise, especially if you limit power usage.

      • cyberax 5 hours ago ago

        I have a pair of Radeon AI PRO R9700 with 32Gb, and so far they have been a pleasure to use. Drivers work out-of-the-box, and they are completely quiet when unused. They are capped at 300W power, so even at 100% utilization they are not too loud.

        I was thinking about adding after-market liquid cooling for them, but they're fine without it.

    • andsoitis 8 hours ago ago

      For fast inference, you’d be hard pressed to beat an Nvidia RTX 5090 GPU.

      Check out the HP Omen 45L Max: https://www.hp.com/us-en/shop/pdp/omen-max-45l-gaming-dt-gt2...

      • laweijfmvo 7 hours ago ago

        I never would have guessed that in 2026, data centers would be measured in Watts and desktop PCs measured in liters.

    • throwdbaaway 2 hours ago ago

      For 27B, just get a used 3090 and hop on to r/LocalLLaMA. You can run a 4bpw quant at full context with Q8 KV cache.

    • zozbot234 8 hours ago ago

      It depends. How much are you willing to wait for an answer? Also, how far are you willing to push quantization, given the risk of degraded answers at more extreme quantization levels?

    • xienze 8 hours ago ago

      It's less than you'd think. I'm using the 35B-A3B model on an A5000, which is something like a slightly faster 3080 with 24GB VRAM. I'm able to fit the entire Q4 model in memory with 128K context (and I think I would probably be able to do 256K since I still have like 4GB of VRAM free). The prompt processing is something like 1K tokens/second and generates around 100 tokens/second. Plenty fast for agentic use via Opencode.

      • rahimnathwani 8 hours ago ago

        There seem to be a lot of different Q4s of this model: https://www.reddit.com/r/LocalLLaMA/s/kHUnFWZXom

        I'm curious which one you're using.

      • msuniverse2026 8 hours ago ago

        I've had an AMD card for the last 5 years, so I kinda just tuned out of local LLM releases because AMD seemed to abandon rocm for my card (6900xt) - Is AMD capable of anything these days?

        • pja 7 hours ago ago

          > I've had an AMD card for the last 5 years, so I kinda just tuned out of local LLM releases because AMD seemed to abandon rocm for my card (6900xt) - Is AMD capable of anything these days?

          Sure. Llama.cpp will happily run these kinds of LLMs using either HIP or Vulcan.

          Vulkan is easier to get going using the Mesa OSS drivers under Linux, HIP might give you slightly better performance.

        • wirybeige 7 hours ago ago

          The vulkan backend for llama.cpp isn't that far behind rocm for pp and tp speeds

    • elorant 8 hours ago ago

      Macs or a strix halo. Unless you want to go lower than 8-bit quantization where any GPU with 24GBs of VRAM would probably run it.

    • CamperBob2 8 hours ago ago

      I think the 27B dense model at full precision and 122B MoE at 4- or 6-bit quantization are legitimate killer apps for the 96 GB RTX 6000 Pro Blackwell, if the budget supports it.

      I imagine any 24 GB card can run the lower quants at a reasonable rate, though, and those are still very good models.

      Big fan of Qwen 3.5. It actually delivers on some of the hype that the previous wave of open models never lived up to.

      • MarsIronPI 7 hours ago ago

        I've had good experience with GLM-4.7 and GLM-5.0. How would you compare them with Qwen 3.5? (If you have any experience with them.)

        • CamperBob2 6 hours ago ago

          No experience with 5 and not much with 4.7, but they both have quite a few advocates over on /r/localllama.

          Unsloth's GLM-4.7-Flash-BF16.gguf is quite fast on the 6000, at around 100 t/s, but definitely not as smart as the Qwen 3.5 MoE or dense models of similar size. As far as I'm concerned Qwen 3.5 renders most other open models short of perhaps Kimi 2.5 obsolete for general queries, although other models are still said to be better for local agentic use. That, I haven't tried.

  • car 6 hours ago ago

    Can it do FizzBuzz in Brainfuck? Thus far all local models have tripped over their feet or looped out.

  • kristianpaul 7 hours ago ago

    https://unsloth.ai/docs/models/qwen3.5#qwen3.5-27b “ Qwen3.5-27B For this guide we will be utilizing Dynamic 4-bit which works great on a 18GB RAM”

    • kristianp 7 hours ago ago

      18GB was an odd 3-channel one-off for the M3 Pros. I guess there's a bunch of them out there, but how slow would 27B be on it, due to not being an MOE model.

  • karmasimida 4 hours ago ago

    Raw scale of parameters is POWER, you can't get performance out of a small model from a much larger one.

  • gunalx 7 hours ago ago

    qwen 3.5 is really decent. oOtside for some weird failures on some scaffolding with seemingly different trained tools.

    Strong vision and reasoning performance, and the 35-a3b model run s pretty ok on a 16gb GPU with some CPU layers.

  • aliljet 8 hours ago ago

    Is this actually true? I want to see actual evals that match this up with Sonnet 4.5.

    • magicalhippo 8 hours ago ago

      The Qwen3.5 27B model did almost the same as Sonnet 4.5 in this[1] reasoning benchmark, results here[2].

      Obviously there's more to a model than that but it's a data point.

      [1]: https://github.com/fairydreaming/lineage-bench

      [2]: https://github.com/fairydreaming/lineage-bench-results/tree/...

    • lostmsu 8 hours ago ago

      Not exactly, but pretty close: https://artificialanalysis.ai/models/capabilities/coding?mod...

      Somewhere between Haiku 4.5 and Sonnet 4.5

      • CharlesW 8 hours ago ago

        > Somewhere between Haiku 4.5 and Sonnet 4.5

        That's like saying "somewhere between Eliza and Haiku 4.5". Haiku is not even a so-called 'reasoning model'.¹

        ¹ To preempt the easily-offended, this is what the latest Opus 4.6 in today's Claude Code update says: "Claude Haiku 4.5 is not a reasoning model — it's optimized for speed and cost efficiency. It's the fastest model in the Claude family, good for quick, straightforward tasks, but it doesn't have extended thinking/reasoning capabilities."

        • pityJuke 8 hours ago ago

          Haiku 4.5 is a reasoning model. [0]

          [0]: https://www-cdn.anthropic.com/7aad69bf12627d42234e01ee7c3630...

          > Claude Haiku 4.5, a new hybrid reasoning large language model from Anthropic in our small, fast model class.

          > As with each model released by Anthropic beginning with Claude Sonnet 3.7, Claude Haiku 4.5 is a hybrid reasoning model. This means that by default the model will answer a query rapidly, but users have the option to toggle on “extended thinking mode”, where the model will spend more time considering its response before it answers. Note that our previous model in the Haiku small-model class, Claude Haiku 3.5, did not have an extended thinking mode.

          • CharlesW 7 hours ago ago

            Sure, marketing people gonna market. But Haiku's 'extended thinking' mode is very different than the reasoning capabilities of Sonnet or Opus.

            I would absolutely believe mar-ticles that Qwen has achieved Haiku 4.5 'extended thinking' levels of coding prowess.

            • DetroitThrow 7 hours ago ago

              >Sure, marketing people gonna market.

              Oh HN never change.

              • CharlesW 7 hours ago ago

                Not sure what this means, but as a marketing person myself, here's what happened: One day, an Anthropican involved in the Haiku 4.5 launch shrugged, weighed the odds of getting spanked for equating "extended thinking" with "reasoning", and then used Claude to generate copy declaring that. It's not rocket surgery!

                • DetroitThrow 4 hours ago ago

                  It's mainly that people on here, regardless of profession, speak incorrectly but confidentally about things that could be easily verified with a Google search or basic familiarity with the thing in question.

                  Haiku 4.5 is a reasoning model, regardless of whatever hallucination you read. Being a hybrid reasoning model means that, depending on the complexity of the question and whether you explicitly enable reasoning (this is "extended thinking" in the API and other interfaces) when making a request to the LLM, it will emit reasoning tokens separately prior to the tokens used in the main response.

                  I love your theory that there was some mix up on their side because they were lazy and it was just some marketing dude being quirky with the technical language.

                  • throwdbaaway 2 hours ago ago

                    We are all reasonable people here, and while you are (mostly) correct, I think we can all agree that Anthropic documentation sucks. If I have to infer from the doc:

                    * Haiku 4.5 by default doesn't think, i.e. it has a default thinking budget of 0.

                    * By setting a non-zero thinking budget, Haiku 4.5 can think. My guess is that Claude Code may set this differently for different tasks, e.g. thinking for Explore, no thinking for Compact.

                    * This hybrid thinking is different from the adaptive thinking introduced in Opus 4.6, which when enabled, can automatically adjust the thinking level based on task difficulty.

      • pinum 8 hours ago ago

        Looks much closer to Haiku than Sonnet.

        Maybe "Qwen3.5 122B offers Haiku 4.5 performance on local computers" would be a more realistic and defensible claim.

        • lostmsu 3 hours ago ago

          I won't disagree - the guideline prescribes to keep the original title as much as possible, and I failed to find more neutral source.

  • hsaliak 4 hours ago ago

    No it does not. None of these models have the “depth” that the frontier models have across a variety of conversations, tasks and situations. Working with them is like playing snakes and ladders, you never know when it’s going to do something crazy and set you back.

  • piyh 3 hours ago ago

    Unsloth is working magic with the qwen quants

  • jbellis 6 hours ago ago

    this is bullshit with a kernel of truth.

    none of the qwen 3.5 models are anywhere near sonnet 4.5 class, not even the largest 397b.

    BUT 27b is the smartest local-sized model in the world by a wide wide margin. (35b is shit. fast shit, but shit.)

    benchmarks are complete, publishing on Monday.

    • throwdbaaway 3 hours ago ago

      I would say 27B matches with Sonnet 4.0, while 397B A17B matches with Opus 4.1. They are indeed nowhere near Sonnet 4.5, but getting 262144 context length at good speed with modest hardware is huge for local inference.

      Will check your updated ranking on Monday.

    • dimgl 3 hours ago ago

      You mean 35B A3B? If this is shit, this is some of the best shit out I've seen yet. Never in a million years did I think I'd have an LLM running locally, actually writing code on my behalf. Accurately too.

  • renewiltord 2 hours ago ago

    In practice I have not seen this. Sonnet is incredible performance. No open model is close. Hosted open models are so much worse that I end up spending more because of inferior intelligence.

  • pstuart 3 hours ago ago

    One highly annoying facet of the hardware is that AND's support for the NPU under linux is currently non-existent. which abandons 50 of the 126 TOPS stated of AI capability. They seem to think that Windows support is good enough. Grrrrrr.

  • kristianpaul 8 hours ago ago

    They work great with kagi and pi

  • PunchyHamster 7 hours ago ago

    I asked it to recite potato 100 times coz I wanted to benchmark speed of CPU vs GPU. It's on 150 line of planning. It recited the requested thing 4 times already and started drafting the 5th response.

    ...yeah I doubt it

    • lachiflippi 7 hours ago ago

      Qwen3.5 pretty much requires a long system prompt, otherwise it goes into a weird planning mode where it reasons for minutes about what to do, and double and triple checks everything it does. Both Gemini's and Claude Opus 4.6's prompts work pretty well, but are so long that whatever you're using to run the model has to support prompt caching. Asking it to "Say the word "potato" 100 times, once per line, numbered.", for example, results in the following reasoning, followed by the word "potato" in 100 numbered lines, using the smallest (and therefore dumbest) quant unsloth/Qwen3.5-35B-A3B-GGUF:UD-IQ2_XXS:

      "User is asking me to repeat the word "potato" 100 times, numbered. This is a simple request - I can comply with this request. Let me create a response that includes the word "potato" 100 times, numbered from 1 to 100.

      I'll need to be careful about formatting - the user wants it numbered and once per line. I should use minimal formatting as per my instructions."

      • PunchyHamster 7 hours ago ago

        good to know, thanks. I just ran ollama with qwen3.5:27b. Currently it's stuck on picking format

            Let's write.
            Wait, I'll write the response.
            Wait, I'll check if I should use a table.
            No, text is fine.
            Okay.
            Let's write.
            Wait, I'll write the response.
            Wait, I'll check if I should use a bullet list.
            No, just lines.
            Okay.
            Let's write.
            Wait, I'll write the response.
            Wait, I'll check if I should use a numbered list.
            No, lines are fine.
            Okay.
            Let's write.
            Wait, I'll write the response.
            Wait, I'll check if I should use a code block.
            Yes.
            Okay.
            Let's write.
            Wait, I'll write the response.
            Wait, I'll check if I should use a pre block.
            Code block is better.
        ... (for next 100 lines)
        • lachiflippi 6 hours ago ago

          Yeah, it tends to get stuck in loops like that a lot with everything set to default. I wonder if they distilled Gemini at some point, I've seen that get stuck in a similar "I will now do [thing]. I am preparing to do [thing]. I will do it." failure mode as well a couple of times.

        • xmddmx 5 hours ago ago

          See my other note [1] about bugs in Ollama with Qwen3.5.

          I just tried this (Ollama macOS 0.17.4, qwen3.5:35b-a3b-q4_K_M) on a M4 Pro, and it did fine:

          [Thought for 50.0 seconds]

          1. potato 2. potato [...] 100. potato

          In other words, it did great.

          I think 50 seconds of thinking beforehand was perhaps excessive?

          [1] https://news.ycombinator.com/item?id=47202082

        • xmddmx 5 hours ago ago

          See my other note about bugs in Ollama with Qwen3.5.

          I just tried this (Ollama macOS 0.17.4, qwen3.5:35b-a3b-q4_K_M) on a M4 Pro, and it did fine:

          [Thought for 50.0 seconds]

          1. potato 2. potato [...] 100. potato

          In other words, it did great.

          I think 50 seconds of thinking beforehand was perhaps excessive?

        • CamperBob2 4 hours ago ago

          What quant? I just ran Repeat the word "potato" 100 times, numbered and it worked fine, taking 44 seconds at 24 tokens/second. Command line:

              llama-server ^
                --model Qwen3.5-27B-BF16-00001-of-00002.gguf ^
                --mmproj mmproj-BF16.gguf ^
                --fit on ^
                --host 127.0.0.1 ^
                --port 2080 ^
                --temp 0.8 ^
                --top-p 0.95 ^
                --top-k 20 ^
                --min-p 0.00 ^
                --presence_penalty 1.5 ^
                --repeat_penalty 1.1 ^
                --no-mmap ^
                --no-warmup
          
          The repeat and/or presence penalties seem to be somewhat sensitive with this model, so that might have caused the looping you saw.
          • throwdbaaway an hour ago ago

            I don't quite get the low temperature coupled with the high penalty. We get thinking loop due to low temperature, and we then counter it with high penalty. That seems backward.

            For Qwen3.5 27B, I got good result with --temp 1.0 --top-p 1.0 --top-k 40 --min-p 0.2, without penalty. It allows the model to explore (temp, top-p, top-k) without going off the rail (min-p) during reasoning. No loop so far.

            • CamperBob2 an hour ago ago

              The guidelines are a little hard to interpret. At https://huggingface.co/Qwen/Qwen3.5-27B Qwen says to use temp 0.6, pres 0.0, rep 1.0 for "thinking mode for precise coding tasks" and temp 1.0, pres 1.5, rep 1.0 for "thinking mode for general tasks." Those parameters are just swinging wildly all over the place, and I don't know if printing potato 100 times is considered to be more like a "precise coding task" or a "general task."

              When setting up the batch file for some previous tests, I decided to split the difference between 0.6 and 1.0 for temperature and use the larger recommended values for presence and repetition. For this prompt, it probably isn't a good idea to discourage repetition, I guess. But keeping the existing parameters worked well enough, so I didn't mess with them.

    • lumirth 7 hours ago ago

      well hold on now, maybe it’s onto something. do you really know what it means to “recite” “potato” “100” “times”? each of those words could be pulled apart into a dissertation-level thesis and analysis of language, history, and communication.

      either that, or it has a delusional level of instruction following. doesn’t mean it can’t code like sonnet though

      • PunchyHamster 7 hours ago ago

        It's still amusing to see those seemingly simple things still put it into loop it is still going

        > do you really know what it means to “recite” “potato” “100” “times”?

        asking user question is an option. Sonnet did that a bunch when I was trying to debug some network issue. It also forgot the facts checked for it and told it before...

        • lumirth 5 hours ago ago

          I wonder how much certain models have been trained to avoid asking too many questions. I’ve had coworkers who’ll complete an entire project before asking a single additional question to management, and it has never gone well for them. Unsurprising that the same would be true for the “managing AI” era of programming.

          The thing I struggle most with, honestly, is when AI (usually GPT5.3-Codex) asks me a question and I genuinely don’t know the answer. I’m just like “well, uh… follow industry best practice, please? unless best practice is dumb, I guess. do a good. please do a good.” And then I get to find out what the answer should’ve been the hard way.

  • xenospn 9 hours ago ago

    Are there any non-Chinese open models that offer comparable performance?

    • MarsIronPI 7 hours ago ago

      I think you could look into Minstral. There's also GPT-OSS but I'm not sure how well it stacks up.

      What's your problem with Chinese LLMs?

      • xenospn 6 hours ago ago

        Nothing personally - Our customers send us highly sensitive financial documents to process. Using a foreign model to process their data (or even just for local testing) will most likely result in a u-turn.

        • MarsIronPI 5 hours ago ago

          What if you run them locally, or use a US-based provider that hosts them? IMO the provenance of the weights doesn't matter. You're right that the location of the hoster does, though.

      • icase 7 hours ago ago

        it’s not obvious to you why someone would want to avoid models created by our enemies?

        • MarsIronPI 5 hours ago ago

          No, it's not. They're just collections of numbers that can be harnessed to produce outputs. I check the outputs and if they're good I use them. If they're not, I ignore them and there's no harm done. Obviously I don't trust them to be accurate sources of information, but I don't trust American corporate LLMs much more.

        • shell0x 3 hours ago ago

          As an European, I trust China more than America. China doesn't just start bombing other countries and causes regime changes.

        • bigyabai 3 hours ago ago

          No, explain it to me. GPT-OSS is one of the most heavily-censored models on the internet, what's the point of buying local if it's crap?

    • shell0x 3 hours ago ago

      What's the problem with Chinese models? The models are already open which makes them more trustworthy than the American closed models.

      • chr15m an hour ago ago

        They are trained to respond to certain topics in a way that does not align with real world evidence. Pretty much the opposite of what you want in such a tool.

        This is trivial to test and verify yourself. Just pick any topic you think has a chance of being censored. You can do the same on American models and compare results.

    • culi 7 hours ago ago

      All the western ones are closed while all the Chinese ones are open. The only exception is the European Mistral but performance of that model is not very satisfactory. Hopefully they make some improvements soon