secure LLM setup"

(vitalik.eth.limo)

34 points | by derrida 3 days ago ago

6 comments

survirtual 14 hours ago ago
I have been running nearly the same experiments for the same reasons. It has been a lot of tweaking and patching because I am on a Strix Halo system with amd gfx1191, but the LLM component is working nicely.
The evaluation I use is having it one shot a 3d scene using threejs. After everything, the output was comparable to Claude Sonnet (Opus actually does worse on this task strangely).
For my local setup, I have settled on the qwen3.5 family after testing most of the local usable models. Here are the models I use ranked by intelligence:
1. Qwen3.5-122B Q4_K_M: ~25 t/s 2. Qwen3.5-27B Q4_K_M: ~18 t/s 3. Qwen3.5-35B Q4_K_M: ~50 t/s
The 122B model is actually very, very smart. But I have found that token speed is more important, and 35B is smart enough. At 50t/s I can get a lot more done, and I am going to build a mechanism for it to escalate intelligence if needed.
GPT-OSS119B failed at my evals.
MistralSmall4 is too buggy to use (I believe it is too new, the templating is messed up, and agentic use has too many issues). That said I evaluated it directly via copy and paste and the results were not comparable to Qwen. But it is very very fast.
I am running a patched build of llamacpp to get these results. There are a few changes that need to be made to increase prompt processing speeds (about a 30% increase) and be able to use rocm. It took a lot of setup but my flake in nixos is stable now.
Long story short, I can confirm a lot of what was shared in his blog.
*this was written at 4:30am on my phone when I wake up, apologies for typos.
[-]
jononor 2 days ago ago
Have been playing with Qwen3.5 35B. Runs OK nicely on a RTX5060Ti, though I would have liked to have a bit higher thoughput (a 5080/5090 would do). It is seemingly close-but-not-quite-there for code generation / agentic coding. So I am actually quite hopeful that in a few years time, using local LLM models will be quite feasible.
[-]