Dell's version of the DGX Spark fixes pain points

(jeffgeerling.com)

114 points | by thomasjb 10 hours ago ago

60 comments

  • mmaunder 3 hours ago ago

    For those of you wondering if this fits your use case vs the RTX 5090 the short answer is this:

    The desktop RTX 5090 has 1792 GB/s of memory bandwidth partially due to the 512 bit bus width, compared to the DGX Spark with a 256 bit bus and 273 GB/s memory bandwidth.

    The RTX 5090 has 32G of VRAM vs the 128G of “VRAM” in the DGX Spark which is really unified memory.

    Also the RTX 5090 has 21760 cuda cores vs 6144 in the DGX Spark. (3.5 x as many). And with the much higher bandwidth in the 5090 you have a better shot at keeping them fed. So for embarrassingly parallel workloads the 5090 crushes the Spark.

    So if you need to fit big models into VRAM and don’t care about speed too much because you are for example, building something on your desktop that’ll run on data center hardware in production, the DGX Spark is your answer.

    If you need speed and 32G of VRAM is plenty, and you don’t care about modeling network interconnections in production, then the RTX 5090 is what you want.

    • chao- 31 minutes ago ago

      It's also worth nothing that the 128GB of "VRAM" in the GB10 is even less straightforward than just being aware that the memory is shared with the CPU cores. There's a lot of details in memory performance that differ across both the different core types, and the two core clusters:

      https://chipsandcheese.com/p/inside-nvidia-gb10s-memory-subs...

  • jasoneckert 9 hours ago ago

    I've got the Dell version of the DGX Spark as well, and was very impressed with the build quality overall. Like Jeff Geerling noted, the fans are super quiet. And since I don't keep it powered on continuously and mainly connect to it remotely, the LED is a nice quick check for power.

    But the nicest addition Dell made in my opinion is the retro 90's UNIX workstation-style wallpaper: https://jasoneckert.github.io/myblog/grace-blackwell/

  • Tepix 9 hours ago ago

    You can get two Strix Halo PCs with similar specs for that $4000 price. I just hope that prompt preprocessing speeds will continue to improve, because Strix Halo is still quite slow in that regard.

    Then there is the networking. While Strix Halo systems come with two USB4 40Gbit/s ports, it's difficult to

    a) connect more than 3 machines with two ports each

    b) get more than 23GBit/s or so per connection, if you're lucky. Latency will also be in the 0.2ms range, which leaves room for improvement.

    Something like Apple's RDMA via Thunderbolt would be great to have on Strix Halo…

    • coder543 8 hours ago ago

      As you allude, the prompt processing speeds are a killer improvement of the Spark which even 2 Strix Halo boxes would not match.

      Prompt processing is literally 3x to 4x higher on GPT-OSS-120B once you are a little bit into your context window, and it is similarly much faster for image generation or any other AI task.

      Plus the Nvidia ecosystem, as others have mentioned.

      One discussion with benchmarks: https://www.reddit.com/r/LocalLLaMA/comments/1oonomc/comment...

      If all you care about is token generation with a tiny context window, then they are very close, but that’s basically the only time. I studied this problem extensively before deciding what to buy, and I wish Strix Halo had been the better option.

      • EnPissant an hour ago ago

        Then again, I have a RTX 5090 + 96GB DDR5-6000 that crushes the spark on prompt processing of gpt-oss-120b (something like 2-3x faster), while token generation is pretty close. The cost I paid was ~$3200 for the entire computer. With the currently inflated RAM prices, it would probably be closer to the dell.

        So while I think the Strix Halo is a mostly useless machine for any kind of AI, and I think the spark is actually useful, I don't think pure inference is a good use case for them.

        It probably only makes sense as a dev kit for larger cloud hardware.

      • plagiarist 3 hours ago ago

        Could I get your thoughts on the Asus GX10 vs. spending on GPU compute? It seems like one could get a lot of total VRAM with better memory bandwidth and make PCIe the bottleneck. Especially if you already have a motherboard with spare slots.

        I'm trying to better understand the trade offs, or if it depends on the workload.

        • coder543 3 hours ago ago

          It depends entirely on what you want to do, and how much you're willing to deal with a hardware setup that requires a lot of configuration. Buying several 3090s can be powerful. Buying one or two 5090s can be awesome, from what I've heard.

        • yowlingcat 2 hours ago ago

          Run a model at all, run a model fast, run a model cheap. Pick 2.

          With LLM workloads, you can run some of the larger local models (at all) and you can run them cheap on the unified 128G RAM machines (Strix Halo/Spark) - for example, gpt-oss-120b. At 4bit quantization given it's an MoE that's natively trained at NVFP4, it'll be pretty quick. Some of the other MoEs with highly compressed active parameter models will also be quick as well. But things will get sluggish as the active parameters increase. The best way to run these models is with a multi-GPU rig so you get speed and VRAM density at once, but that's expensive.

          With other workloads such as image/video generation, the unified vram doesn't help as much and the operations themselves intrinsically run better on the beefier GPU cores, in part because many of the models are relatively small compared to LLM (6B-20B active parameters) but generating from those parameters is definitely GPU compute intensive. So you get infinitely more from a 3090 (maybe even a slightly lesser card) than you do from a unified memory rig.

          If you're running a mixture of LLM and image/video generation workloads, there is no easy answer. Some folks on a budget opt for a unified memory machine with an eGPU to get the best of both worlds, but I hear drivers are an issue. Some folks use the Mac studios which while quite fast force you to be inside the Metal ecosystem rather than CUDA and aren't as pleasant for dev or user ecosystem. Some folks build a multi CPU server rig with a ton of vanilla RAM (used to be popular for folks who wanted to run DeepSeek before RAM prices spiked). Some folks buy older servers with VRAM dense but dated cards (thing Pascal, Volta, etc, or AMD MI50/100). There's no free lunch with any of these options, honestly.

          If you don't have a very clear sense of something you can buy that you won't regret, it's hard to go wrong using any of the cloud GPU hyperscalers (Runpod, Modal, Northflank, etc) or something like Fal or Replicate where you can try out the open source models and pay per request. Sure, you'll spend a bit more on unit costs, but it'll force you to figure out if you have your workloads figured out enough to where the pain of having it in the cloud stings enough to where you want to buy and own the metal -- if the answer is no, even if you could afford it, you'll often be most happiest just using the right cloud service!

          Ask me how I figured out all of the above the hard way...

    • Aurornis 8 hours ago ago

      The primary advantage of the DGX box is that it gives you access to the nVidia ecosystem. You can develop against it almost like a mini version of the big servers you're targeting.

      It's not really intended to be a great value box for running LLMs at home. Jeff Geerling talks about this in the article.

      • saagarjha 15 minutes ago ago

        DGX Spark has a different compute capability, so no, you really aren’t.

      • cmrdporcupine 8 hours ago ago

        Exactly this. I'm not sure why people keep drumming the "a Mac or Strix Halo is faster/cheaper" drum. Different market.

        If I want to do hobby / amateur AI research or do stuff with fine tuning models etc, learn the tooling. I'm better off with the DG10 than AMD or Apple's systems.

        The Strix Halo machines look nice. I'd like one of those too. Especially if/when they ever get around to getting it into a compelling laptop.

        But I ordered the ASUS Ascent DG10 machine (since it was more easily available for me than the other versions of these) because I want to play around with fine tuning open weight models, learning tooling, etc.

        That and I like the idea of having a (non-Apple) Aarch64 linux workstation at home.

        Now if the courier would just get their shit together and actually deliver the thing...

        • mapontosevenths 8 hours ago ago

          I have this device, it's exactly as you say. This is a device for AI research and development. My buddies mac ultra beats it squarely for inference workloads, but for real tinkering it can't be beat.

          I've used it to fine tune 20+ models in the last couple of weeks. Neither a Mac or Strix Halo even try to compete.

        • lostmsu 4 hours ago ago

          I got ASUS ROG Flow Z13 128G with Ryzen AI 395, and I am able to train nanoGPT with little effort. On Windows (haven't tried Linux), where ROCm was just released recently.

          See https://news.ycombinator.com/item?id=46052535

          • cmrdporcupine 2 hours ago ago

            I had my finger over the buy button for various Strix Halo machines for weeks.

            I ended up going with the Asus GB10 because if the goal is to "learn me some AI tooling" I didn't want to have to add "learn me some only recently and shallowly supported-in-linux AMD tooling" to the mix.

            I hate NVIDIA -- the company -- but in this case it comes down to pure self-interest in that I want to add some of this stuff to my employable skill set, and NVIDIA ships the machine with all the pieces I need right in the OS distribution.

            Plus I have a bias for ARM over x86.

            Long run I'm sure I'll end up with a Strix Halo type machine in my collection at some point.

            But I also expect those machines to not drop in price, and perhaps even go up, as right now the 128GB of RAM in them is worth the price of the whole machine.

    • benreesman 3 hours ago ago

      NVFP4 (and to a lesser extent, MXFP8) work, in general. In terms of usable FLOPS the DGX Spark and the GMTek EVO-X2 both lose to the 5090, with NCCL and OpenMPI set up the DGX is still the nicest way to dev for our SBSA future. Working on that too, harder problem.

  • kristianp 6 hours ago ago

    I know it's just a quick test, but llama 3.1 is getting a bit old. I would have liked to see a newer model that can fit, such as gpt-oss-120, (gpt-oss-120b-mxfp4.gguf), which is about 60gb of weights (1).

    (1) https://github.com/ggml-org/llama.cpp/discussions/15396

    • geerlingguy 3 hours ago ago
      • coder543 2 hours ago ago

        Even though big, dense models aren't fashionable anymore, they are perfect for specdec, so it can be fun to see the speedup that is possible.

        I can get about 20 tokens per second on the DGX Spark using llama-3.3-70B with no loss in quality compared to the model you were benchmarking:

            llama-server \
                --model      llama-3.3-70b-instruct-ud-q4_k_xl.gguf \
                --model-draft llama-3.2-1b-instruct-ud-q8_k_xl.gguf \
                --ctx-size      80000 \
                --ctx-size-draft 4096 \
                --draft-min 1 \
                --draft-max 8 \
                --draft-p-min 0.65 \
                -ngl 999 \
                --flash-attn on \
                --parallel 1 \
                --no-mmap \
                --jinja \
                --temp 0.0 \
                -fit off
        
        Specdec works well for code, so the prompt I used was "Write a React TypeScript demo".

            prompt eval time = 313.70 ms / 40 tokens (7.84 ms per token, 127.51 tokens per second)
            eval time = 46278.35 ms / 913 tokens (50.69 ms per token, 19.73 tokens per second)
            total time = 46592.05 ms / 953 tokens
            draft acceptance rate = 0.87616 (757 accepted / 864 generated)
        
        The draft model cannot affect the quality of the output. A good draft model makes token generation faster, and a bad one would slow things down, but the quality will be the same as the main model either way.
      • kristianp 3 hours ago ago

        Thanks!

    • eurekin 5 hours ago ago

      Correct, most of r/LocalLlama moved onto next gen MoE models mostly. Deepseek introduced few good optimizations that every new model seems to use now too. Llama 4 was generally seen as a fiasco and Meta haven't made a release since

      • fragmede 4 hours ago ago

        What are some of the models people are using? (Rather than naming the ones they aren't.)

        • eurekin 3 hours ago ago

          GLM 4.7 is new and promising. MinMax 2.1 is good for agents. Of course the qwen3 family, vl versions are spectacular. NVIDIA Nemotron Nano 3 excels at long context and the unsloth variant has been extended to 1m tokens.

          I thought the last one was a toy, until I tried with a full 1.2 megabyte repomix project dump. It actually works quite well for general code comprehension across the whole codebase, CI scripts included.

          Gpt-oss-120 is good too, altough I'm yet to try it out for coding specifically

          • magicalhippo an hour ago ago

            Since I'm just a pleb with a 5090, I run GPT-OSS 20B a lot, since it fits comfortably in VRAM with max context size. I find it quite decent for a lot of things, especially after I set reasoning effort to high and disabled top-k and top-p and set min-p to something like 0.05.

            For the Qwen3-VL, I recently read that someone got significantly better results by using F16 or even F32 versions of the vision model part, while using a Q4 or similar for the text model part. In llama.cpp you can specify these separately[1]. Since the vision model part is usually quite small in comparison, this isn't as rough as it sounds. Haven't had a chance to test that yet though.

            [1]: https://github.com/ggml-org/llama.cpp/blob/master/tools/serv... (using --mmproj AFAIK)

          • nightski an hour ago ago

            Does GLM 4.7 run well on the spark? I thought I read it didn’t but it wasn’t clear.

  • alecco 8 hours ago ago

    IMHO DGX Spark at $4,000 is a bad deal with only 273 GB/s bandwidth and the compute capacity between a 5070 and a 5070 TI. And with PCIe 5.0 at 64 GB/s it's not such a big difference.

    And the 2x 200 GBit/s QSFP... why would you stack a bunch of these? Does anybody actually use them in day-to-day work/research?

    I liked the idea until the final specs came out.

    • BadBadJellyBean 6 hours ago ago

      I think the selling point is the 128GB of unified system memory. With that you can run some interesting models. The 5090 maxes out at 32GB. And they cost about $3000 and more at the moment.

      • alecco 5 hours ago ago

        1. /r/localllama unanimously doesn't like the Spark for running models

        2. and for CUDA dev it's not worth the crazy price when you can dev on a cheap RTX and then rent a GH or GB server for a couple of days if you need to adjust compatibility and scaling.

        • mi_lk 43 minutes ago ago

          What’s GH and GB server?

          • saagarjha 14 minutes ago ago

            GH200/GB200, Nvidia’s server hardware

        • BadBadJellyBean 5 hours ago ago

          I am not on reddit. What are they saying?

          • mapontosevenths 4 hours ago ago

            It isn't for "running models." Inference workloads like that are faster on a mac studio, if that's the goal. Apple has faster memory.

            These devices are for AI R&D. If you need to build models or fine tune them locally they're great.

            That said, I run GPT-OSS 120B on mine and it's 'fine'. I spend some time waiting on it, but the fact that I can run such a large model locally at a "reasonable" speed is still kind of impressive to me.

            It's REALLY fast for diffusion as well. If you're into image/video generation it's kind of awesome. All that compute really shines when for workloads that aren't memory speed bound.

            • lostmsu 3 hours ago ago

              With a 5070 Ti performance that's a weird choice for R&D as well. You won't be able to train models that require anywhere near 100GB VRAM due to slow processing, and 5070 Ti is under $1k

              • mapontosevenths 33 minutes ago ago

                Yeah, that's mostly fair, but it kind of misses the point. This is a professional tool for AI R&D. Not something that strives to be the cheapest possible option for the homelab. It's fine to use them in the lab, but that's not who they built it for.

                If I wanted to I could go on ebay, buy a bunch of parts, build my own system, install my own OS, compile a bunch of junk, tinker with config files for days, and then fire up an extra generator to cope with the 2-4x higher power requirements. For all that work I might save a couple of grand and will be able to actually do less with it. Or... I could just buy a GB10 device and turn it on.

                It comes preconfigured to run headless and use the NVIDIA ecosystem. Mine has literally never had a monitor attached to it. NVIDIA has guides and playbooks, preconfigured docker containers, and documentation to get me up and developing in minutes to hours instead of days or weeks. If it breaks I just factory reset it. On top of that it has the added benefit of 200Gbe QSFP networking that would cost $1,500 on it's own. If I decide I need more oomph and want a cluster I just buy another one and connect them, then copy/paste the instructions from NVIDIA.

                • saagarjha 12 minutes ago ago

                  You could also pay someone $5 an hour and they’ll give you a better machine for similar hassle.

  • kachapopopow 9 hours ago ago

    Dell fixing issues instead of creating new ones? That's a new one for me. Would rather still not deal with their firmware updaters thought.

    • cjbgkagh 8 hours ago ago

      Give them a chance, I’m sure they’ll add new issues in one of their monthly bios updates.

      • kachapopopow 8 hours ago ago

        nothing beats perfectly good vendor firmware updates packaged in an obscenely complicated bash file that just extracts the tool and runs it while performing unnecessary and often broken validation that only runs on hardware that is part of their ecosystem (ex: dell nic on non dell chassis).

  • npalli 6 hours ago ago

    Seems you are paying the Dell tax of 15%. The same setup is $4K from NVidia, Lenovo and $3K for 1TB at Asus.

    https://www.dell.com/en-us/shop/desktop-computers/dell-pro-m...

  • dagaci 7 hours ago ago

    A nice little AI review with comparison of the CPU/Power Draw & Networking would be interested in seeing a fine-tuning comparison too. I think pricing was missing also.

    • geerlingguy 5 hours ago ago

      I've been working on fine tuning testing, it's something I hope to set up for comparison against the Mac Studio and Framework Desktop clusters soon.

  • cat_plus_plus 5 hours ago ago

    I have a slightly cheaper similar box, NVIDIA Thor Dev Kit. The point is exactly to avoid deploying code to servers that cost half a million dollars each. It's quite capable in running or training smart LLMs like Qwen3-Next-80B-A3B-Instruct-NVFP4. So long as you don't tear your hair out first figuring out pecularities and fighting with bleeding edge nightly vLLM builds.

    • echion 4 hours ago ago

      > training smart LLMs like Qwen3-Next-80B-A3B-Instruct-NVFP4

      Sounds interesting; can you suggest any good discussions of this (on the web)?

  • postalrat 4 hours ago ago

    Spark's biggest paint point is the price. Does it fix that?

    • bigyabai an hour ago ago

      There's an entire line of Linux-supported Jetson products available for your perusal, in addition to all of the GTX and RTX cards that have native ARM64 support.

  • barelysapient 6 hours ago ago

    Great article but would be nice to see how larger models work.

  • nightski 4 hours ago ago

    It's a product without a purpose.

  • colordrops 8 hours ago ago

    I assume they didn't fix the memory bandwidth pain point though.

    • llm_nerd 8 hours ago ago

      The memory bandwidth limitation is baked into the GB10, and every vendor is going to be very similar there.

      I'm really curious to see how things shift when the M5 Ultra with "tensor" matmul functionality in the GPU cores rolls out. This should be a multiples speed up of that platform.

      • storus 8 hours ago ago

        My guess is M5 Ultra will be like DGX Spark for token prefill and M3 Ultra for token generation, i.e. the best of both worlds, at FP4. Right now you can combine Spark with M3U, the former streaming the compute, lowering TTFT, the latter doing the token generation part; with M5U that should no longer be necessary. However given RAM prices situation I am wondering if M5U will ever get close to the price/performance of Spark + M3U we have right now.

        • echion 4 hours ago ago

          > you can combine Spark with M3U, the former streaming the compute, lowering TTFT, the latter doing the token generation part

          Are you doing this with vLLM, or some other model-running library/setup?

      • kristianp 3 hours ago ago

        The M3 ultra was released about 18 months after the original M3, so you could be waiting a while for the M5 Ultra.

    • cat_plus_plus 5 hours ago ago

      At least for transformers, it can be kind of fixed with MOE + NVFP4 for small working set despite large resident size.