Why CUDA translation wont unlock AMD

(eliovp.com)

85 points | by JonChesterfield 8 days ago ago

83 comments

  • lvl155 a day ago ago

    Let’s just say what it is: devs are too constrained to jump ship right now. It’s a massive land grab and you are not going to spend time tinkering with CUDA alternatives when even a six-month delay can basically kill your company/organization. Google and Apple are two companies with enough resources to do it. Google isn’t because they’re keeping it proprietary to their cloud. Apple still have their heads stuck in sand barely capable of fixing Siri.

  • mandevil a day ago ago

    Yeah, ROCm focused code will always beat generic code compiled down. But this is a really difficult game to win.

    For example, Deepseek R-1 released optimized for running on Nvidia HW, and needed some adaption to run as well on ROCm. This was for the exact same reasons that ROCm code will beat generic code compiled into ROCm, in the same way. Basically the Deepseek team, for their own purposes, created R-1 to fit Nvidia's way of doing things (because Nvidia is market-dominant) on their own. Once they released, someone like Elio or AMD would have to do the work of adapting the code to run best on ROCm.

    For more established players who weren't out-of-left-field surprises like Deepseek, e.g. Meta's Llama series, mostly coordinate with AMD ahead of release day, but I suspect that AMD still has to pay for the engineering work themselves while Meta does the work to make it run on Nvidia themselves. This simple fact, that every researcher makes their stuff work on CUDA themselves, but AMD or someone like Elio has to do the work to move it over to get it to be as performant on ROCm, that is what keeps people in the CUDA universe.

    • latchkey a day ago ago

      Kimi is the latest model that isn't running correctly on AMD. Apparently close to Deepseek in design, but different enough that it just doesn't work.

      It isn't just the model, it is the engine to run it. From what I understand this model works with sglang, but not with vLLM.

      • suprjami a day ago ago

        This is normal. An inference engine needs support for a model's particular implementation of the transformer architecture. This has been true for almost every model release since we got local weights.

        Really good model providers send a launch-day patch to llama.cpp and vllm to make sure people can run their model instantly.

        • latchkey a day ago ago

          It isn't about normal or not. It is that those patches are done for Nvidia, but not AMD. It is that it takes time and energy to vet them and merge them into those projects. Kimi has been out for 3 months now and it still doesn't run out of the box on vLLM on AMD, but it works just fine with Nvidia.

  • jfalcon 10 hours ago ago

    CUDA isn't all that and a bag of chips. It just is the Facebook/Twitter of the data science and from that LLM space. There are Tensor processors and other ASIC processing for specific compute functions that can give Nvidia a challenge but it's not unknown to every gamer that there has always been a performance difference between Nvidia and AMD/ATI.

    Ok, point made Nvidia. Kudos.

    ATI had their moment in the sun before ASIC ate their cryptocurrency lunch. So both still had/have relevance outside gaming. But, I see Intel is starting to take GPU space seriously and they shouldn't be ruled out.

    And as mentioned elsewhere in the comments, there is Vulkan. There is also this idea of virtualized GPU as now the bottleneck isn't CPU... it's now GPU. As I mentioned there are Tensors, Moore's Law thresholds coming back again with 1 nanometer manufacturing, there is going to be a point where we hit a threshold again with current chips and we will have a change in technology - again.

    So while Nvidia is living the life - unless they have a crystal ball of how tensors are going to go that they can move CUDA towards, there is going to be a "co-processor" future coming up and with that the next step towards NPUs will be taken. This is where Apple is aligning itself because, after all, they had the money and just said "Nope, we'll license this round out..."

    AMD isn't out yet. They, along with Intel and others, just need to figure out where the next bottlenecks are and build those toll bridges.

  • buggyworld a day ago ago

    This reminds me of the database wire protocol debates. PostgreSQL-compatible databases (like Aurora, Neon, Supabase) achieve compatibility by speaking the Postgres wire protocol, but the truly successful ones don't just translate—they rebuild core components to leverage their own architecture (Aurora's storage layer, Neon's branching, etc.).

    The article frames this as "CUDA translation bad, AMD-native good" but misses the strategic value of compatibility layers: they lower switching costs and expand the addressable market. NVIDIA's moat isn't just technical—it's the ecosystem inertia. A translation layer that gets 80% of NVIDIA performance might be enough to get developers to try AMD, at which point AMD-native optimization becomes worth the investment.

    The article is essentially a product pitch for Paiton disguised as technical analysis. The real question isn't "should AMD hardware pretend to be CUDA?" but rather "what's the minimum viable compatibility needed to overcome ecosystem lock-in?" PostgreSQL didn't win by being incompatible—it won by being good AND having a clear migration path from proprietary databases.

    • throwaway31131 12 hours ago ago

      I don't think this was the point of the post at all.

      Their bottom line summed it up perfectly.

      "We’re not saying “never use CUDA-on-AMD compilers or CUDA-to-HIP translators”. We’re saying don’t judge AMD based on them."

    • KetoManx64 a day ago ago

      Which LLM did you use to write this?

  • fulafel a day ago ago

    Vulkan Compute is catching up with HIP (or whatever the compatibility stuff is called now), which seems like a welcome break from CUDA - in this benchmark it beats CUDA in some benchmarks on AMD: https://www.phoronix.com/review/rocm-71-llama-cpp-vulkan

  • apitman a day ago ago

    Our open source library is currently hard locked into CUDA due to nvCOMP for gzip decompression (bioinformatics files). What I wouldn't give for an open source implementation, especially if it targeted WebGPU.

  • martinald a day ago ago

    Perhaps I'm misunderstanding the market dynamics; but isn't AMDs real opp inference over research?

    Training etc still happens on NVDA but inference is somewhat easy to do on vLLM et al with a true ROCm backend with little effort?

  • kj4ips a day ago ago

    I agree pretty strongly. A translation layer like this is making an intentional trade: Giving up performance and HW alignment for less lead time and effort to make a proper port.

  • manjose2018 a day ago ago

    https://geohot.github.io//blog/jekyll/update/2025/03/08/AMD-...

    https://tinygrad.org/ is the only viable alternative to CUDA that I have seen popup in the past few years.

    • djsjajah a day ago ago

      I can't tell if you are making a joke or not.

      They are not even remotely equivalent. tinygrad is a toy.

      If you are serious, I would be interested to hear how you see tinygrad replacing CUDA. I could see a tiny grad zealot arguing that it is gong to replace torch, but CUDA??

      Have you looked into AMD support in torch? I would wager that like for like, a torch/amd implementation of a models is going to run rings around a tinygrad/amd implementation.

    • erichocean a day ago ago

      Both Mojo and ThunderKittens/HipKittens are viable on AMD.

    • bigyabai a day ago ago

      Viable how? "Feasible" might be a better word here, I haven't heard many (any?) war-stories about a TinyBox in production but maybe I'm OOTL.

  • outside1234 a day ago ago

    Are the hyperscalers really using CUDA? This is what really matters. We know Google isn't. Are AWS and Azure for their hosting of OpenAI models et al?

    • wmf a day ago ago

      All Nvidia GPUs, which are probably >70% of the market, use CUDA.

    • bigyabai a day ago ago

      > We know Google isn't.

      Google isn't internally, so far as we know. Google's hyperscaler products have long offered CUDA options, since the demand isn't limited to AI/tensor applications that cannibalize TPU's value prop: https://cloud.google.com/nvidia

  • jmward01 a day ago ago

    Right now we need diversity in the ecosystem. AMD is finally getting mature and hopefully that will lead to them truly getting a second, strong, opinion into ecosystem. The friction this article talks about is needed to push new ideas.

  • doctorpangloss a day ago ago

    All they have to do is release air cooled 96GB GDDR7 PCIe5 boards with 4x Infinity Link, and charge $1,900 for it.

  • latchkey a day ago ago

    A bit of background. This is directed towards Spectral Compute (Michael) and https://scale-lang.com/. I know both of these guys personally and consider them both good friends, so you have to understand a bit of the background in order to really dive into this.

    My take on it is fairly well summed up at the bottom of Elio's post. In essence, Elio is taking the view of "we would never use scale-lang for llms because we have a product that is native AMD" and Michael is taking the view of "there is a ton of CUDA code out there that isn't just AI and we can help move those people over to AMD... oh and by the way, we actually do know what we are doing, and we think we have a good chance at making this perform."

    At the end of the day, both companies (my friends) are trying to make AMD a viable solution in a world dominated by an ever growing monopoly. Stepping back a bit and looking at the larger picture, I feel this is fantastic and want to support both of them in their efforts.

    • Eliovp 19 hours ago ago

      Just to clarify: this post was not written against Spectral Compute. Their recent investment news was the trigger for us to finally write it yes, but the idea has been on our minds for a long time.

      We actually think solutions like theirs are good for the ecosystem, they make it easier for people to at least try AMD without throwing away their CUDA code.

      Our point is simply this: if you want top-end performance (big LLMs, specific floating point support, serious throughput/latency), translation alone is not enough. At that point you have to focus on hardware-specific tuning: CDNA kernel shapes, MFMA GEMMs, ROCm-specific attention/TP, KV-cache, etc.

      That’s the layer we work on: we don’t replace people’s engines, we just push the AMD hardware as hard as it can go.

  • measurablefunc a day ago ago

    [flagged]

    • jsheard a day ago ago

      The article is literally about how rote translation of CUDA code to AMD hardware will always give sub-par performance. Even if you wrangled an AI into doing the grunt work for you, porting heavily-NV-tuned code to not-NV-hardware would still be losing strategy.

      • measurablefunc a day ago ago

        The point of AI is that it is not a rote translation & 1:1 mapping.

        • jsheard a day ago ago

          > Take the ROCm specification, take your CUDA codebase, let one of the agentic AIs translate it all into ROCm

          ...sounds like asking for a 1:1 mapping to me. If you meant asking the AI to transmute the code from NV-optimal to AMD-optimal as it goes along, you could certainly try doing that, but the idea is nothing more than AI fanfic until someone shows it actually working.

          • measurablefunc a day ago ago

            Now that I have clarified the point about AI optimizing the code from CUDA to fit AMD's runtime what is your contention about the possibility of such a translation?

            • bigyabai a day ago ago

              There is an old programmer's joke about writing abstractions and expecting zero-cost.

              • measurablefunc a day ago ago

                How does that apply in this case? The whole point is that the agentic AI/AGI skips all the abstractions & writes optimized low-level code for each GPU vendor from a high-level specification. There are no abstractions other than whatever specifications GPU vendors provide for their hardware which are fed into the agentic AI/AGI to do the necessary work of creating low-level & optimized code for specific tasks.

                • a day ago ago
                  [deleted]
    • cbarrick a day ago ago

      Has this been done successfully at scale?

      There's a lot of handwaving in this "just use AI" approach. You have to figure out a way to guarantee correctness.

      • measurablefunc a day ago ago

        There are tons of test suites so if the tests pass then that provides a reasonable guarantee of correctness. Although it would be nice if there was also proof of correctness for the compilation from CUDA to AMD.

    • bee_rider a day ago ago

      The AI is too busy making Ghibli profile pictures or whatever the thing is now.

      We asked it to make a plan for how to fix the situation, but it got stuck.

      “Ok, I’m helping the people build an AI to translate NVIDIA codes to AMD”

      “I don’t have enough resources”

      “Simple, I’ll just use AMD chips to run an AI code translator, they are under-utilized. I’ll make a step by step process to do so”

      “Step 1: get code kernels for the AMD chips”

      And so on.

      • measurablefunc a day ago ago

        The real question is whether it will be as unprofitable to do this type of automated runtime translation from one GPU vendor to another as it is to generate Mario clips & Ghibli images.

    • j16sdiz a day ago ago

      The same as "Why just outsourcing it to <some country >"

      AI aint magic.

      You need more effort to manage, test and validate that.

      • measurablefunc a day ago ago

        [flagged]

        • j16sdiz a day ago ago

          I am not saying this is impossible, but I am down voting this because this is _not an interesting discussion_.

          The whole point of having an online discussion forum is to exchange and create new ideas. What you are advocating is essentially "maybe we can stop generating new ideas because we don't have to. we should just sit and wait"... Well, yes, no, maybe. but this is not what I expect to get from here.

          • measurablefunc a day ago ago

            You can do whatever you want & I didn't ask you to participate in my thread so unless you are going to address the actual points I'm making instead of telling me it is not interesting then we don't have anything to discuss further.

        • j16sdiz a day ago ago

          So, your strategy for solving this is: Convert it to another harder problem (AGI). Now it is somebody else (AI researcher)'s problem.

          This is outsourcing the task to AI researchers.

          • measurablefunc a day ago ago

            They keep promising that this kind of capability is right around the corner & they keep showing how awesome they are at passing math exams so why is this a more difficult problem than solving problems in abstract algebra & scheme theory on humanity's last exam or whatever is the latest & greatest benchmark for mathematical capabilities?

            • Daedren 20 hours ago ago

              They all have to make promises and have to dream big to keep the AI bubble from popping.

              • measurablefunc 6 hours ago ago

                I agree which is why it's a bit odd that so many people still think that Sam Altman & Elon Musk are honest technologists instead of unscrupulous grifters.

        • nutjob2 a day ago ago

          > Isn't AGI around the corner?

          There isn't even a concrete definition of intelligence, let alone AGI, so no it's not.

          That's just mindless hype at this point.

    • colonCapitalDee a day ago ago

      No. This is far beyond the capabilities of current AI, and will remain so for the foreseeable future. You could let your model of choice churn on this for months, and you will not get anywhere. It will be able to reach a somewhat working solution quickly, but it will soon reach a point where for every issue it fixes, it introduces one or more issues or regressions. LLMs are simply not capable of scaffolding complexity like a human, and lack the clarity and rigorousness of thought required to execute an *extremely* ambitious project like performant CUDA to ROCm translation.

      • impossiblefork 20 hours ago ago

        I don't think it really is, especially not if it's turned into a system, with multiple prompts, verification, etc.

        Humans have problems with IMO problems, and this kind of kernel translation is a problem which is easier to humans, where there's more probably actually more data and a problem where the system can get feedback by simply running it and measuring memory use, runtime etc.

        It'd be a system and no one has developed it, but I think it can be done with present LLMs as a core mechanism. They just need to be trained with RL on this specific problem.

        Anyone with a good LLM, from Google to Mistral could probably do this, but it'd be a project.

      • measurablefunc a day ago ago

        [flagged]

        • colonCapitalDee a day ago ago

          Well that's your problem. Here's a tip: just because someone says something doesn't mean you have to listen to them

        • bigyabai a day ago ago

          This explains everything.

    • imtringued 20 hours ago ago

      The AI needs a mental model of the hardware for that to work.

      • measurablefunc 10 hours ago ago

        Algorithms do not have mental models of anything.

    • Blackthorn a day ago ago

      I don't know why you're being downvoted because even if you're Not Even Wrong, that's exactly the sort of thing that has been endlessly presented by people trying to sell AI as something that AI will absolutely do for us.

      • measurablefunc a day ago ago

        [flagged]

        • bigyabai a day ago ago

          It's hard to catch-on to a deliberately dishonest pretense. You could clone 10,000 John Carmacks to do the job for you, Nvidia would still be a $5 trillion business next time you wake up.

          • measurablefunc a day ago ago

            [flagged]

            • bigyabai a day ago ago

              I'm not talking to them. I am responding to you - your sardonic piss-take is against HN guidelines and written in bad-faith.

              • measurablefunc a day ago ago

                [flagged]

                • bigyabai a day ago ago

                  Sure, and thieves probably recommend that the cops move on & refrain from following where they're headed.

                  Be honest and you won't have to fend-off accusations of bad-faith. I'm inclined to agree with your overall point of AI being overhyped, but you've gutted your own logic so hard in the process that your stance is unrecognizable. You've developed a meaningfully ambiguous stance to an elaborate and deeply incorrect series of arguments.

                  • measurablefunc a day ago ago

                    [flagged]

                    • bigyabai 9 hours ago ago

                      I didn't even read the first iteration of your profile. If your stance can't be substantiated without hidden subtext, you're not making a good point.

                      Your future comments are definitely going to be flagged unless you switch to a good-faith writing style.

                      • measurablefunc 6 hours ago ago

                        Doesn't bother me either way but you can keep trying to pathologize instead of actually making substantive points to address anything I have actually clearly laid out.

    • bigyabai a day ago ago

      Because it doesn't work like that. TFA is an explanation of how GPU architecture dictates the featureset that is feasibly attainable at runtime. Throwing more software at the problem would not enable direct competition with CUDA.

      • measurablefunc a day ago ago

        I am assuming that is all part of the specification that the agentic AI is working with & since AGI is right around the corner I think this is a simple enough problem that can be solved with AI.

  • pixelpoet a day ago ago

    Actual article title says "won't"; wont is a word meaning habit or proclivity.

    • InvisGhost a day ago ago

      In situations like this, I try to focus on whether the other person understood what was being communicated rather than splitting hairs. In this case, I don't think anyone would be confused.

      • philipallstar 13 hours ago ago

        Probably best to just fix the spelling.

        • Eliovp 10 hours ago ago

          That's what you get when you don't use AI to write an article :p