Rust Threads on the GPU

(vectorware.com)

121 points | by PaulHoule 6 days ago ago

48 comments

  • nynx 2 days ago ago

    I don’t understand why this is a useful effort. It seems like a solution in source of a problem. It’s going to be incredibly easy to end up with hopelessly inefficient programs that need a full redesign in a normal gpu programming model to be useful.

    • LegNeato 2 days ago ago

      Founder here.

      1. Programming GPUs is a problem. The ratio of CPUs to CPU programmers and GPUs to GPU programmers is massively out of whack. Not because GPU programming is less valuable or lucrative, because GPUs are weird and the tools are weird.

      2. We are more interested in leveraging existing libraries than running existing binaries wholesale (mostly within a warp). But, running GPU-unaware code leaves a lot of space for the compiler to move stuff around and optimize things.

      3. The compiler changes are not our product, the GPU apps we are building with them are. So it is in our interest to make the apps very fast.

      Anyway, skepticism is understandable and we are well aware code wins arguments.

      • electronsoup a day ago ago

        > the GPU apps we are building with them are

        I can't help but get the feeling you have use-case end-goal in mind that's opaque to many of us who are gpu-ignorant.

        It could be helpful if there were an example of the type of application that would be nicer to express through your abstractions.

        (I think what you've shown so far is super cool btw)

        • LegNeato a day ago ago

          Agreed, and thank you.

      • ghighi7878 2 days ago ago

        Good point about gpu threads being equivalent to warps.

      • jzombie 2 days ago ago

        Do you foresee this being faster than SIMD for things like cosine similarity? Apologies if I missed that context somewhere.

        • LegNeato 2 days ago ago

          It depends. At VecorWare are a bit of an extreme case in that we are inverting the relationship and making the GPU the main loop that calls out to the CPU sparingly. So in that model, yes. If your code is run in a more traditional model (CPU driving and using GPU as a coprocessor), probably not. Going across the bus dominates most workloads. That being said, the traditional wisdom is becoming less relevant as integrated memory is popping up everywhere and tech like GPUDirect exists with the right datacenter hardware.

          These are the details we intend to insulate people from so they can just write code and have it run fast. There is a reason why abstractions were invented on the CPU and we think we are at that point for the GPU.

          (for the datacenter folks I know hardware topology has a HUGE impact that software cannot overcome on its own in many situations)

      • amelius 2 days ago ago

        > The ratio of CPUs to CPU programmers and GPUs to GPU programmers is massively out of whack.

        These days I just ask an LLM to write my optimized GPU routines.

        • 2 days ago ago
          [deleted]
      • shmerl 2 days ago ago

        > because GPUs are weird and the tools are weird.

        Why is it also that terminology is so all over the place. Subgroups, wavefronts, warps etc. referring to the same concept. That doesn't help it.

        • adrian_b 2 days ago ago

          This is the fault of NVIDIA, who, instead of using the terms that had been used for decades in computer science before them for things like vector lanes, processor threads, processor cores etc., have invented a new jargon by replacing each old word with a new word, in order to obfuscate how their GPUs really work.

          Unfortunately, ATI/AMD has imitated slavishly many things initiated by NVIDIA, so soon after that they have created their own jargon, by replacing every word used by NVIDIA with a different word, also different from the traditional word, enhancing the confusion. The worst is that the NVIDIA jargon and the AMD jargon sometimes reuse traditional terms by giving them different meanings, e.g. an NVIDIA thread is not what a "thread" normally means.

          Later standards, like OpenCL, have attempted to make a compromise between the GPU vendor jargons, instead of going back to a more traditional terminology, so they have only increased the number of possible confusions.

          So to be able to understand GPUs, you must create a dictionary with word equivalences: traditional => NVIDIA => ATI/AMD (e.g. IBM 1964 task = Vyssotsky 1966 thread => NVIDIA warp => AMD wavefront).

        • MindSpunk 2 days ago ago

          All the names for waves come from different hardware and software vendors adopting names for the same or similar concept.

          - Wavefront: AMD, comes from their hardware naming

          - Warp: Nvidia, comes from their hardware naming for largely the same concept

          Both of these were implementation detail until Microsoft and Khronos enshrined them in the shader programming model independent of the hardware implementation so you get

          - Subgroup: Khronos' name for the abstract model that maps to the hardware

          - Wave: Microsoft's name for the same

          They all describe mostly the same thing so they all get used and you get the naming mess. Doesn't help that you'll have the API spec use wave/subgroup, but the vendor profilers will use warp/wavefront in the names of their hardware counters.

          • raphlinus 2 days ago ago

            You can add to this the Apple terminology, which is simdgroup. This reinforces your point – vendors have a tendency to invent their own terminology rather than use something standard.

            • amelius 2 days ago ago

              Rule #1 in not getting involved in any patent lawsuit: don't use the same terminology as your competitors.

            • coffeeaddict1 a day ago ago

              I have to give it to Apple though in this case. Waves or warps are ridiculously uninformative, while simdgroups at least convey some useful information.

      • seivan a day ago ago

        [dead]

    • zozbot234 2 days ago ago

      It looks like they're trying to map the entire "normal GPU programming model" to Rust code, including potentially things like GPU "threads" (to SIMD lanes + masked/predicated execution to account for divergence) and the execution model where a single GPU shader is launched in multiple instances with varying x, y and z indexes. In this context, it makes sense to map the GPU "warp" to a Rust thread since GPU lanes, even with partially independent program counters, still execute in lockstep much like CPU SIMD/SPMD or vector code.

    • rl3 2 days ago ago

      I think they've taken the integration difficulty into account.

      Besides, full redesign isn't so expensive these days (depending).

      >It seems like a solution in source of a problem.

      Agreed, but it'll be interesting to see how it plays out.

  • kevmo314 2 days ago ago

    Isn't this turning a GPU into a slower CPU? It's not like CPUs are slow, in fact they're quite a bit faster than any single GPU thread. If code is written in a GPU unaware way it's not going to take advantage of the reasons for being on the GPU in the first place.

    • lmeyerov 2 days ago ago

      We have this issue in GFQL right now. We wrote the first OSS GPU cypher query language impl, where we make a query plan of gpu-friendly collective operations... But today their steps are coordinated via the python, which has high constant overheads.

      We are looking to shed something of the python<->c++<->GPU overheads by pushing macro steps out of python and into C++. However, it'd probably be way better to skip all the CPU<>GPU back-and-forth by coordinating the task queue in the GPU to beginwith . It's 2026 so ideally we can use modern tools and type as safety for this.

      Note: I looked at the company's GitHub and didn't see any relevant oss, which changes the calculus for a team like our's. Sustainable infra is hard!

    • fooker a day ago ago

      > It's not like CPUs are slow, in fact they're quite a bit faster than any single GPU thread.

      This was overwhelmingly true ten years ago, not so much now.

      Modern GPU threads are about 3Ghz, CPUs are still slightly faster in theory but the larger amounts of local fast memory makes GPU threads pretty competitive in practice.

      • kevmo314 19 hours ago ago

        Are you writing this from the future? The latest gen nvidia gpus sit at around 2-2.5 GHz and the latest gen amd cpus sit 4-5 GHz.

        That matches my personal experience too, writing naive cuda code that doesn’t take advantage of parallelism is roughly half the speed of running it on cpu.

    • pjmlp a day ago ago

      Additionally there is still too much performance left on the table by not properly using CPU vector units.

      • fooker a day ago ago

        SIMD performance in modern Intel and AMD cpus is so bad that it is useless outside very specific circumstances.

        This is mainly because vector instructions are implemented by sharing resources with other parts of the CPU and more or less stalls pipelines, significantly reduces ipc, makes out of order execution ineffective.

        The shared resources are often involve floating point registers and compute, so it's a double whammy.

        • pjmlp a day ago ago

          Yet, it is still faster than not doing nothing, or calling into the GPU, on workloads where the bus traffic takes the majority of execution time.

          • fooker a day ago ago

            The comparison is often just plain old linear code.

            For example, one simd instruction vs multiple arithmetic instructions.

              x1 += y1
              x2 += y2
              x3 += y3
              x4 += y4
              
            We have fifty years of CPU design optimizing for this. More often than not, you'll find this works better than vector instructions in practice.

            The concept behind vector instructions is great, and it starts to work out for larger widths like 512 bits. But it's extremely tricky to take advantage of that much SIMD with a compiler or manually.

            • nian_yu 11 hours ago ago

              Hello, where can I read more about this? It's the first time I hear SIMD has drawbacks and I'm interested in hearing more different opinions.

            • pjmlp a day ago ago

              Yet there are gains of doing e.g. string searches with SIMD, which you naturally aren't going to do in CUDA.

              • fooker a day ago ago

                For sure, it makes sense for nice well defined problems that execute in isolation.

                Think of the situation where the string search is running on a system that has hyper threading and a bunch of cores, and a normal amount of memory bandwidth.

                It'll be faster, but at the same time make everything else worse if you overuse vector instructions.

                (also cherry on top: some modern CPUs automagically lower the clock when they encounter vector instructions!!!)

    • imtringued 2 days ago ago

      I've seen this objection pop up every single time and I still don't get it.

      GPUs run 32, 64 or even 128 vector lanes at once. If you have a block of Rust threads that are properly programmed to take advantage of the vector processing by avoiding divergence, etc how is it supposed to be slower?

      Consider the following:

      You have a hyperoptimized matrix multiplication kernel and you also have your inference engine code that previously ran on the CPU. You now port the critical inference engine code to directly run on the GPU, thereby implementing paged attention, prefix caching, avoiding data transfers, context switches, etc. You still call into your optimized GPU kernels.

      Where is the magical slowdown supposed to come from? The mega kernel researchers are moving more and more code to the GPU and they got more performance out of it.

      Is it really that hard to understand that the CUDA style programming model is inherently inflexible and limiting? I think the fundamental problem here is that Nvidia marketing gave an incredibly misleading perception of how the hardware actually works. GPUs don't have thousands of cores like CUDA Core marketing suggests. They have a hundred "barrel CPU"-like cores.

      The RTX 5090 is advertised to have 21760 CUDA cores. This is a meaningless number in practice since the "CUDA cores" are purely a software concept that doesn't exist in hardware. The vector processing units are not cores. The RTX 5090 actually has 170 streaming multiprocessors each with their own instruction pointer that you can target independently just like a CPU. The key restriction here is that if you want maximum performance you need to take advantage of all 128 lanes and you also need enough thread copies that only differ in the subset of data they process so that the GPU can switch between them while it is working on multi cycle instructions (memory loads and the like). That's it.

      Here is what you can do: You can take a bunch of streaming processors, lets say 8 and use them to run your management code on the GPU side without having to transfer data back to the CPU. When you want to do heavy lifting you are in luck, because you still have 162 streaming processors left to do whatever you want. You proceed to call into cuDNN and get great performance.

      • winwang an hour ago ago

        Each SM should have 4 independent SMSPs (32 lanes each), no? Effectively a "4-core" task-parallel system per SM.

      • Bimos 2 days ago ago

        > a block of Rust threads that are properly programmed to take advantage of the vector processing by avoiding divergence

        But the library is using a warp as a single thread

      • kevmo314 a day ago ago

        > a block of Rust threads that are properly programmed to take advantage of the vector processing by avoiding divergence

        Sure, if you have that then of course it would be fast. But that’s not what this library is proposing.

      • monideas 2 days ago ago

        I really appreciate the way you've explained this. Are there any resources you recommend to reach your level of understanding?

  • cbHXBY1D a day ago ago

    I've been using Rust's Burn library recently and have avoided writing kernels in CubeCL because it's lack of documentation and my lack of experience. I would love to see some working together here.

  • gpm 2 days ago ago

    Is this proprietary, or something I can play around with? I can't find a repo.

    • LegNeato 2 days ago ago

      It is not, we just haven't yet upstreamed everything.

  • Talderigi a day ago ago

    If you map Rust threads to warps, aren’t we basically turning the GPU into a very expensive CPU?

    • zozbot234 a day ago ago

      This blog post doesn't address how GPU "threads" can be mapped to Rust SIMD/SPMD "lanes" yet, though it hints at that. I assume that this is planned to be a topic for a future blog post.

      I'd like to understand how the overall amount of "warps" to be launched on the GPU is determined. Is it fixed at shader launch, or can warps be created and destroyed on demand? If it's fixed, these are more like CPU-side "virtual processors" (in OS terminology) than true OS "threads".

    • hgomersall a day ago ago

      It makes sense when the inner operations are vectorisable, as in the example.

  • 20k 2 days ago ago

    This programming model seems like the wrong one, and I think its based on some faulty assumptions

    >Another advantage of this approach is that it prevents divergence by construction. Divergence occurs when lanes within a warp take different branches. Because thread::spawn() maps one closure to one warp, every lane in that warp runs the same code. There is no way to express divergent branching within a single std::thread, so divergence cannot occur

    This is extremely problematic - being able to write divergent code between lanes is good. Virtually all high performance GPGPU code I've ever written contains divergent code paths!

    >The worst case is that a workload only uses one lane per warp and the remaining lanes sit idle. But idle lanes are strictly better than divergent lanes: idle lanes waste capacity while divergent lanes serialize execution

    This is where I think it falls apart a bit, and we need to dig into GPU architecture to find out why. A lot of people think that GPUs are a bunch of executing threads, that are grouped into warps that execute in lockstep. This is a very overly restrictive model of how they work, that misses a lot of the reality

    GPUs are a collection of threads, that are broken up into local work groups. These share l2 cache, which can be used for fast intra work group communication. Work groups are split up into subgroups - which map to warps - that can communicate extra fast

    This is the first problem with this model: it neglects the local work group execution unit. To get adequate performance, you have to set this value much higher than the size of a warp, at least 64 for a 32-sized warp. In general though, 128-256 is a better size. Different warps in a local work group make true independent progress, so if you take this into account in rust, its a bad time and you'll run into races. To get good performance and cache management, these warps need to be executing the same code. Trying to have a task-per-warp is a really bad move for performance

    >Each warp has its own program counter, its own register file, and can execute independently from other warps

    The second problem is: it used to be true that all threads in a warp would execute in lockstep, and strictly have on/off masks for thread divergence, but this is strictly no longer true for modern GPUs, the above is just wrong. On a modern GPU, each *thread* has its own program counter and callstack, and can independently make forward progress. Divergent threads can have a better throughput than you'd expect on a modern GPU, as they get more capable at handling this. Divergence isn't bad, its just something you have to manage - and hardware architectures are rapidly improving here

    Say we have two warps, both running the same code, where half of each warp splits at a divergence point. Modern GPUs will go: huh, it sure would be cool if we just shifted the threads about to produce two non divergent warps, and bam divergence solved at the hardware level. But notice that to get this hardware acceleration, we need to actually use the GPU programming model to its fullest

    The key mistake is to assume that the current warp model is always going to stick rigidly to being strictly wide SIMD units with a funny programming model, but we already ditched that concept a while back on GPUs, around the Pascal era. As time goes on this model will only increasingly diverge from how GPUs actually work under the hood, which seems like an error. Right now even with just the local work group problems, I'd guess you're dropping ~50% of your performance on the table, which seems like a bit of a problem when the entire reason to use a GPU is performance!

    • david-gpu 2 days ago ago

      > Modern GPUs will go: huh, it sure would be cool if we just shifted the threads about to produce two non divergent warps, and bam divergence solved at the hardware level

      Could you kindly share a source for this? Shader Execution Reordering (SER) is available for Ray tracing, but it is not a general-purpose feature that can be used in generic compute shaders.

      > Divergent threads can have a better throughput than you'd expect on a modern GPU, as they get more capable at handling this. Divergence isn't bad, its just something you have to manage - and hardware architectures are rapidly improving here

      I would strongly advise against this. GPUs are highly efficient when neighboring threads within a warp access neighboring data and follow largely the same code path. Even across warps, data locality is highly desirable.

      • 20k a day ago ago

        >I would strongly advise against this. GPUs are highly efficient when neighboring threads within a warp access neighboring data and follow largely the same code path. Even across warps, data locality is highly desirable.

        Its a bit like saying writing code at all is bad though. Divergence isn't desirable, but neither is running any code at all - sometimes you need it to solve a problem

        Not supporting divergence at all is a huge mistake IMO. It isn't good, but sometimes its necessary

        >Could you kindly share a source for this? Shader Execution Reordering (SER) is available for Ray tracing, but it is not a general-purpose feature that can be used in generic compute shaders.

        https://docs.nvidia.com/cuda/cuda-programming-guide/03-advan...

        My understanding is that this is fully transparent to the programmer, its just more advanced scheduling for threads. SER is something different entirely

        Nvidia are a bit vague here, so you have to go digging into patents if you want more information on how it works

    • imtringued 2 days ago ago

      >The second problem is: it used to be true that all threads in a warp would execute in lockstep, and strictly have on/off masks for thread divergence, but this is strictly no longer true for modern GPUs, the above is just wrong. On a modern GPU, each thread has its own program counter and callstack, and can independently make forward progress. Divergent threads can have a better throughput than you'd expect on a modern GPU, as they get more capable at handling this. Divergence isn't bad, its just something you have to manage - and hardware architectures are rapidly improving here

      I haven't found any evidence of the individual program counter thing being true beyond one niche application: Running mutexes for a single vector lane, which is not a performance optimization at all. In fact, you are serializing the performance in the worst way possible.

      From a hardware design perspective it is completely impractical to implement independent instruction pointers other than maybe as a performance counter. Each instruction pointer requires its own read port on the instruction memory and adding 32, 64 or 128 read ports to SRAM is prohibitively expensive, but even if you had those ports, divergence would still lead to some lanes finishing earlier than others.

      What you're probably referring to is a scheduler trick that Nvidia has implemented where they split a streaming processor thread with divergence into two masked streaming processor threads without divergence. This doesn't fundamentally change anything about divergence being bad, you will still get worse performance than if you had figured out a way to avoid divergence. The read port limitations still apply.

      • 20k a day ago ago

        Threads have program counters individually according to nvidia, and have done for nearly 10 years

        https://docs.nvidia.com/cuda/cuda-programming-guide/03-advan...

        > the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity

        Divergence isn't good, but sometimes its necessary - not supporting it in a programming model is a mistake. There are some problems you simply can't solve without it, and in some cases you absolutely will get better performance by using divergence

        People often tend to avoid divergence by writing an algorithm that does effectively what pascal and earlier GPUs did, which is unconditionally doing all the work on every thread. That will give worse performance than just having a branch, because of the better hardware scheduling these days