6 comments

  • survirtual 14 hours ago ago

    I have been running nearly the same experiments for the same reasons. It has been a lot of tweaking and patching because I am on a Strix Halo system with amd gfx1191, but the LLM component is working nicely.

    The evaluation I use is having it one shot a 3d scene using threejs. After everything, the output was comparable to Claude Sonnet (Opus actually does worse on this task strangely).

    For my local setup, I have settled on the qwen3.5 family after testing most of the local usable models. Here are the models I use ranked by intelligence:

    1. Qwen3.5-122B Q4_K_M: ~25 t/s 2. Qwen3.5-27B Q4_K_M: ~18 t/s 3. Qwen3.5-35B Q4_K_M: ~50 t/s

    The 122B model is actually very, very smart. But I have found that token speed is more important, and 35B is smart enough. At 50t/s I can get a lot more done, and I am going to build a mechanism for it to escalate intelligence if needed.

    GPT-OSS119B failed at my evals.

    MistralSmall4 is too buggy to use (I believe it is too new, the templating is messed up, and agentic use has too many issues). That said I evaluated it directly via copy and paste and the results were not comparable to Qwen. But it is very very fast.

    I am running a patched build of llamacpp to get these results. There are a few changes that need to be made to increase prompt processing speeds (about a 30% increase) and be able to use rocm. It took a lot of setup but my flake in nixos is stable now.

    Long story short, I can confirm a lot of what was shared in his blog.

    *this was written at 4:30am on my phone when I wake up, apologies for typos.

    • a96 7 hours ago ago

      Are you running NixOS on the Strix as a main OS? Would be interesting to read how that's working and what your config is like. Any containers involved or hand building outside of Nix?

      • survirtual 5 hours ago ago

        Yes, NixOS on the Strix Halo system.

        I am using nixos unstable pkgs. The published rocmPackages were recently updated and now include kernels for gfx1151, which I surprisingly found out this morning. Before, you would have to set a flag to use the older kernels because they were not available.

        My flake modules with rocm config are a bit messy, but maybe I can find time to throw a repo up with it. It contains all necessary packages, flags, boot options, llamacpp patches, and some hacks to get pytorch working smoothly with rocm.

        What this means: no, I do not need containers anymore for replicating working configs. The flake configures the system with appropriate libs. I build llamacpp with patches for rocm, and I can run comfyui for generative processes. I have generative image and video working as of today, and next I will get generative 3d modeling working. I'd like to have Trellis2 running this week.

        None of this would be possible for me without NixOS, as an aside. It keeps track of configuration for me so I do not have scatter shell scripts and unpredictable deps anymore. I used to build Zed from source with scripts, for example. Now it is a module with patches. Llamacpp is the same. Very clean and requires no working memory, when something needs adjusting I just go refresh myself with the module in one place.

        • a96 2 hours ago ago

          Sounds very neat. Thanks for the explanation. I'm not too familiar with Nix, though I've done a few installs. This sounds like a very interesting setup.

  • jononor 2 days ago ago

    Have been playing with Qwen3.5 35B. Runs OK nicely on a RTX5060Ti, though I would have liked to have a bit higher thoughput (a 5080/5090 would do). It is seemingly close-but-not-quite-there for code generation / agentic coding. So I am actually quite hopeful that in a few years time, using local LLM models will be quite feasible.

    • survirtual 14 hours ago ago

      A AMD Ryzen AI Max Pro 396 will get 50t/s with Qwen3.5 35B.

      In addition, the these local models are very, very, very sensitive to the template used. Make sure it is correct. I was using the wrong template and it would answer but felt like it had a brain worm.

      The parameters must also be what is recommended, otherwise they go off the rails.

      I get great results now after messing with it for a while. I prefer the 35B model because I enjoy how fast tokens appear at 50t/s, but at around 20-25t/s with the 122B model, it is also completely usable. And that one is very smart.