No. It's trying to analyze the CPU core but clarifies the device under test as that may have performance implications. There is cooling and possibly manufactured configured power limits.
This is awesome. I'm going to have to spend some time digging over this.
I got one of these GB10s, but the ASUS variety. So far fairly happy with it. Most days I don't remember I'm on ARM.
It's pretty performant, snappy, about the same speed as my other mini PC, a Ryzen 9 7940HS Minisforum UM 790 Pro, but with double the amount of cores and many times the amount of RAM.
Have you tried running any local LLMs via llama.cpp? I am curious if that high RAM is effectively usable as unified memory for larger models. I wonder if the memory bandwidth is sufficient to get decent performance on something like a 70b model or if it bottlenecks.
Makes sense regarding the MoE performance. I am not sure the cost argument holds up for high volume workloads though. If you are running batch jobs 24/7 the hardware pays for itself in a few months compared to API opex. It really just comes down to utilization.
Do you have specific t/s numbers for those dense models? I'm curious just how severe the memory bandwidth bottleneck gets in practice.
I'm not sure I agree on the cost aspect though. For high-volume production workloads the API bills scale linearly and can get painful fast. If you can amortize the hardware over a year and keep the data local for privacy, the math often works out in favor of self-hosting.
For Qwen2.5-72B-Instruct-Q5_K_M at 32k context, I fed it a 26k token file (truncated fiction novel) asking it to summarize, and it input processed at 224 tok/s and output generated at 3 tok/s. Not really good enough for interactive use without frustration. Not just from watching it reply, but also the long wait for it to actually read the book.
On the same hardware gpt-oss-120b at 128k context, I fed it a longer version of the input (a whole novel, 97k tok), and it input processed at 1650 tok/s and output generated at 27 tok/s. Just fast enough IMO
I bought it primarily so I could learn some of the toolchain for fine-tuning / training stuff, not so much for running inference, which its only "ok" at.
If I was primarily interested in that, I would have probably bought one of the cheaper Strix Halo machines.
It's also just a decent non-Mac ARM64 workstation, with large quantities of RAM. Which in 2026 is a bit of unicorn.
Apologies for the tangent, but isn't this like saying "sliced tomato featuring BLT sandwich"?
No. It's trying to analyze the CPU core but clarifies the device under test as that may have performance implications. There is cooling and possibly manufactured configured power limits.
I get what they're doing. I've never seen that phrasing before.
This is awesome. I'm going to have to spend some time digging over this.
I got one of these GB10s, but the ASUS variety. So far fairly happy with it. Most days I don't remember I'm on ARM.
It's pretty performant, snappy, about the same speed as my other mini PC, a Ryzen 9 7940HS Minisforum UM 790 Pro, but with double the amount of cores and many times the amount of RAM.
Have you tried running any local LLMs via llama.cpp? I am curious if that high RAM is effectively usable as unified memory for larger models. I wonder if the memory bandwidth is sufficient to get decent performance on something like a 70b model or if it bottlenecks.
You can run large-ish MoE model at good speeds, like gpt-oss-120b, it's snappy enough even with big context.
But large and dense at the same time is a bit slow.
Running a local LLM will be a load of money for something much slower than the api providers though.
Makes sense regarding the MoE performance. I am not sure the cost argument holds up for high volume workloads though. If you are running batch jobs 24/7 the hardware pays for itself in a few months compared to API opex. It really just comes down to utilization.
Do you have specific t/s numbers for those dense models? I'm curious just how severe the memory bandwidth bottleneck gets in practice.
I'm not sure I agree on the cost aspect though. For high-volume production workloads the API bills scale linearly and can get painful fast. If you can amortize the hardware over a year and keep the data local for privacy, the math often works out in favor of self-hosting.
For Qwen2.5-72B-Instruct-Q5_K_M at 32k context, I fed it a 26k token file (truncated fiction novel) asking it to summarize, and it input processed at 224 tok/s and output generated at 3 tok/s. Not really good enough for interactive use without frustration. Not just from watching it reply, but also the long wait for it to actually read the book.
On the same hardware gpt-oss-120b at 128k context, I fed it a longer version of the input (a whole novel, 97k tok), and it input processed at 1650 tok/s and output generated at 27 tok/s. Just fast enough IMO
I bought it primarily so I could learn some of the toolchain for fine-tuning / training stuff, not so much for running inference, which its only "ok" at.
If I was primarily interested in that, I would have probably bought one of the cheaper Strix Halo machines.
It's also just a decent non-Mac ARM64 workstation, with large quantities of RAM. Which in 2026 is a bit of unicorn.
I would love to see a comparison between the A725 and X925 cores.
Not quite in the same depth, but there are some more general benchmarks across all cores and latencies here: https://github.com/geerlingguy/sbc-reviews/issues/92
Wow, this repo and the ai-benchmarks repo are the ones I wanted https://github.com/geerlingguy/ai-benchmarks/issues/34
Thank you for doing these. Earned a star and a watch from me on both! Minor sponsor donation as gratitude.
Would be sick to have an RSS feed for your data releases.
Will consider that at some point; a lot of the time is just spent getting the data, heh.
Note to myself: Cortex X925 was originally called X5. The Current Generation X930 is now called C1-Ultra used in Mediatek 9500.