Apple Silicon GPU Support in Mojo

(forum.modular.com)

153 points | by mpweiher 2 days ago ago

68 comments

  • lordofgibbons 2 days ago ago

    How many people here actually write custom cuda/triton kernels? An extremely small hand full of people write them (and they're all on one discord server).. which then gets integrated into Pytorch, Megatron-LM, vLLM, SGLang, etc. The rest of us in the ML/AI ecosystem have absolutely no incentive to migrate off of python due to network effects even though I think it's a terrible language for maintainable production systems.

    If Mojo focuses on systems software ( and gets rid of exceptions - Chris, please <3 ) it will be a serious competitor to Rust and Go. It has all the performance and safety of Rust with a significantly easier learning curve.

    • chrislattner 2 days ago ago

      Mojo doesn't have C++-like exceptions, but does support throwing. The codegen approach is basically like go's (where you return a bool + error conceptually) but with the python style syntax to make it way more ergonomic than Go.

      We have a public roadmap and are hard charging about improving the language, check out https://docs.modular.com/mojo/roadmap/ to learn more.

      -Chris

    • achierius 2 days ago ago

      Plenty of people do, many more than are in that server -- I asked some of my former coworkers and none knew about it, but we all spent a whole lot of hours tuning CUDA kernels together :). You have one perspective on this sector, but it's not the only one!

      Some example motivations:

      - Strange synchronization/coherency requirements

      - Working with new hardware / new strategies that Nvidia&co haven't fine-tuned yet

      - Just wanting to squeeze out some extra performance

    • latemedium 2 days ago ago

      I think part of the reason why just a few people write custom CUDA / triton kernels is that it's really hard to do well. Languages like Mojo aim to make that much easier, and so hopefully more people will be able to write them (and do other interesting things with GPUs that are too technically challenging right now)

      • dogma1138 a day ago ago

        The only question will there be a benefit in writing your own kernels in something like Mojo than to skip that part altogether and use the primitives with already highly optimized kernels that frameworks like torch provide especially when it comes to performance.

    • pzo a day ago ago

      If Mojo is superset of Python same like TypeScript is superset of Javascript I would really be willing to use Mojo even if just for the option to have support for mobile platforms. Right now python trying to support iOS and Android but it would take years before people will mange to compile at least so of the ecosystem.

    • adastra22 2 days ago ago

      Probably tens of thousands of people. You do know that CUDA is used for more than just AI/ML?

      • pjmlp 2 days ago ago

        I guess given all the hype people tend to forget why GPGPU is used for, it is like the common memme of why CUDA when there is PyTorch.

        • dogma1138 a day ago ago

          Non-ML GPGPU uses also have frameworks/libraries which provide primitives which run on the back of already highly optimized kernels.

          • adastra22 a day ago ago

            Only for the some use cases. Many of us don’t.

            • dogma1138 a day ago ago

              Outside of novel research what field doesn’t have this? Not challenging your assertion just curious.

              • adastra22 a day ago ago

                Novel research. That in itself is a large “field.” Lots of people in academia and industry exploring new use cases and developing new libraries.

    • huevosabio 2 days ago ago

      Which Discord server? I want in!

  • behnamoh 2 days ago ago

    signing up to try a programming language (Mojo) is as bad as logging in to your terminal before using it (Warp).

    • timmyd 2 days ago ago

      Co-founder here. There isn't any signup - that was 2+ years ago and we've been iterating a lot with the community and listening to feedback - which has been wonderful. Go freely and install with Pip, UV, Pixi etc -> https://docs.modular.com/mojo/manual/install

  • rubymamis 2 days ago ago

    I'm very excited for Mojo - more about the programming language itself than all the ML stuff.

  • Archit3ch 2 days ago ago

    Using this in Julia since 2022. :D

    • bahmboo 2 days ago ago

      I would be truly interested if you could expand on this. I know I can do my own research but I'm starting down the path of what could be called performance python or something similar and real world stories help.

      • Archit3ch 2 days ago ago

        My use case is realtime audio processing (VST plugins).

        Metal.jl can be used to write GPU kernels in Julia to target an Apple Silicon GPU. Or you can use KernelAbstractions.jl to write once in a high-level CUDA-like language to target NVIDIA/AMD/Apple/Intel GPUs. For best performance, you'll want to take advantage of vendor-specific hardware, like Tensor Cores in CUDA or Unified Memory on Mac.

        You also get an ever-expanding set of Julia GPU libraries. In my experience, these are more focused on the numerical side rather than ML.

        If you want to compile an executable for an end user, that functionality was added in Julia 1.12, which hasn't been released yet. Early tests with the release candidate suggest that it works, but I would advise waiting to get a better developer experience.

        • larme 2 days ago ago

          I'm very interesting in this field (realtime audio + GPU programming). How do you deal with the latency? Do you send or multiple single vectors/buffers to GPU?

          Also I think because samples in one channel need to be processed sequentially, does that mean mono audio processing won't benefit a lot from GPU programming. Or maybe you are dealing with spectral signal processing?

          • Archit3ch a day ago ago

            Yes, I process per-buffer, same as on CPU.

            You need to find parallelism somewhere to make it worth it. This can be multiple independent channels/voices, one large simulation, one high quality simulation, a large neural network, solving PDEs, voxel simulation (https://www.youtube.com/watch?v=1bS7sHyfi58), additive synthesis, a multitude of FFTs...

            • larme a day ago ago

              Thanks for the answers!

  • timmg 2 days ago ago

    I (vaguely) think what the Mojo guys' goal is makes a lot of sense. And I understand why they thought Python was the way to start.

    But I just think Python is not the right language to try to turn into this super-optimized parallel processing system they are trying to build.

    But their target market are Python programmers, I guess. So I'm not sure what a better option would be.

    It would be interesting for them to develop their own language and make it all work. But "yet another programming language" is a tough sell.

    • cactusfrog 2 days ago ago

      What language do you think they should have based Mojo off of? I think Python syntax is great for tensor manipulation.

      • fluidcruft a day ago ago

        I wouldn't mind a python flavor that has a syntax for tensors/matrices that was a bit less bolted on in parts vs Matlab. You get used to python and numpy's quirks it but it is a bit jarring at first.

        Octave has a very nice syntax (it's an extended Matlab's syntax to provide the good parts of numpy broadcasting). I assume Julia uses something very similar to that. I have wanted to work with Julia but it's so frustrating to have to build so much of the non-interesting stuff that just exists in python. And back when I looked into it there didn't seem to be an easy way to just plug Julia into python things and incrementally move over. Like you couldn't swap the numerics and keep with matplotlib things you already had. You had to go learn Julia's ways of plotting and doing everything. It would have been nice if there were an incremental approach.

        One thing I am on the fence about is indexing with '()' vs '[]'. In Matlab both function calls and indexing use '()' which is a Fortran style (the ambiguity lets you swap functions for matrices to reduce memory use but that's all possible with '[]' in python) which can sometimes be nice. Anyway if you have something like mojo you're wanting to work directly with indices again and I haven't done that in a long time.

        Ultimately I don't think anyone would care if mojo and python just play nicely together with minimal friction. (Think: "hey run this mojo code on these numpy blobs"). If I can build GUIs and interact with the OS and parse files and the interact with web in python to prep data while simultaneously crunching in mojo that seems wonderful.

        I just hate that Julia requires immediately learning all the dumb crap that doesn't matter to me. Although it's seeming like LLM seem very good at the dumb crap so some sort of LLM translation for the dumb crap could be another option.

        In summary: all mojo actually needs is to be better than numba and cython type things with performance that at least matches C++ and Fortran and the GPU libraries. Once that happens then things like the mojo version of pandas will be developed (and will replace things like polars)

    • golly_ned 2 days ago ago

      The syntax is based on python, but its runtime is not. So nothing about the contrast between the python language and mojo's use as a super-parallelized parallel processing system is inconsistent.

    • pjmlp 2 days ago ago

      This is attempt number 2, it was already tried before with Swift for Tensorflow.

      Guess why it wasn't a success, or why Julia is having adoption issues among the same community.

      Or why although Zig is basically Modula-2 type system, it is being more hyped than Modula-2 ever was since 1978 (it is even part of GCC nowadays).

      Syntax and familiarity matters.

      • a96 a day ago ago

        I think the only Zig hype I'm seeing is about its compiler and compatibility. Those might well be the same two reasons why you never hear about modula-2.

        • pjmlp a day ago ago

          I am older than Modula-2, so I heard a lot, many of the folks hyping Zig still think the world started with UNIX.

    • ziofill 2 days ago ago

      Exactly, the idea of not having to learn yet a new language is very compelling

    • mempko 2 days ago ago

      Except by all accounts they succeeded. I believe they have the fastest matmul on nvidia chips in the industry

  • GeekyBear 2 days ago ago

    I'm interested to see how this shakes out now that they are well past the proof of concept stage and have something that can replace CUDA on Nvidia hardware without nerfing performance in addition to other significant upsides.

    Just the notion of replacing the parts of LLVM that force it to remain single threaded would be a major sea change for developer productivity.

  • lqstuart 2 days ago ago

    I like Chris Lattner but the ship sailed for a deep learning DSL in like 2012. Mojo is never going to be anything but a vanity project.

    • growthwtf 2 days ago ago

      Nah. There's huge alpha here, as one might say. I feel like this comment could age even more poorly than the infamous dropbox comment.

      Even with Jax, PyTorch, HF Transformers, whatever you want to throw at it--the dx for cross-platform gpu programming that are compatible with large language models requirements specifically is extremely bad.

      I think this may end up be the most important thing that Lattner has worked on in his life (And yes, I am aware of his other projects!)

      • lqstuart 2 days ago ago

        Comments like this view the ML ecosystem in a vacuum. New ML models are almost never written—all LLMs for example are basically GPT-2 with extremely marginal differences—and the algorithms themselves are the least of the problem in the field. The 30% improvements you get from kernels and compiler tricks are absolutely peanuts compared to the 500%+ improvements you get from upgrading hardware, adding load balancing and routing, KV and prefix caching, optimized collective ops etc. On top of that, the difficulty even just migrating Torch to the C++11 ABI to access fp8 optimizations is nigh insurmountable in large companies.

        I say the ship sailed in 2012 because that was around when it was decided to build Tensorflow around legacy data infrastructure at Google rather than developing something new, and the rest of the industry was hamstrung by that decision (along with the baffling declarative syntax of Tensorflow, and the requirement to use Blaze to build it precluding meaningful development outside of Google).

        The industry was so desperate to get away from it that they collectively decided that downloading a single giant library with every model definition under the sun baked into it was the de facto solution to loading Torch models for serving, and today I would bet you that easily 90% of deep learning models in production revolve around either TensorRT, or a model being plucked from Huggingface’s giant library.

        The decision to halfass machine learning was made a long time ago. A tool like Mojo might work at a place like Apple that works in a vacuum (and is lightyears behind the curve in ML as a result), but it just doesn’t work on Earth.

        If there’s anyone that can do it, it’s Lattner, but I don’t think it can be done, because there’s no appetite for it nor is the talent out there. It’s enough of a struggle to get big boy ML engineers at Mag 7 companies to even use Python instead of letting Copilot write them a 500 line bash script. The quality of slop in libraries like sglang and verl is a testament to the futility of trying to reintroduce high quality software back into deep learning.

        • chrislattner 2 days ago ago

          Thank you for the kind words! Are you saying that AI model innovation stopped at GPT-2 and everyone has performance and gpu utilization figured out?

          Are you talking about NVIDIA Hopper or any of the rest of the accelerators people care about these days? :). We're talking about a lot more performance and TCO at stake than traditional CPU compilers.

          • lqstuart 2 days ago ago

            I’m saying actual algorithmic (as in not data) model innovation has never been a significant part of the revenue generation in the field. You get your random forest, or ResNet, or BERT, or MaskRCNN, or GPT-2-with-One-Weird-Trick, and then you spend four hours trying to figure out how to preprocess your data.

            On the flipside, far from figuring out GPU efficiency, most people with huge jobs are network bottlenecked. And that’s where the problem arises: solutions for collective comms optimization tend to explode in complexity because, among other reasons, you now have to package entire orchestrators in your library somehow, which may fight with the orchestrators that actually launch the job.

            Doing my best to keep it concise, but Hopper is like a good case study. I want to use Megatron! Suddenly you need FP8, which means the CXX11 ABI, which means recompiling Torch along with all those nifty toys like flash attention, flashinfer, vllm, whatever. Ray, jsonschema, Kafka and a dozen other things also need to match the same glibc and glibc++ versions. So using that as an example, suddenly my company needs C++ CICD pipelines, dependency management etc when we didn’t before. And I just spent three commas on these GPUs. And most likely, I haven’t made a dime on my LLMs, or autonomous vehicles, or weird cyborg slavebots.

            So what all that boils down to is just that there’s a ton of inertia against moving to something new and better. And in this field in particular, it’s a very ugly, half-assed, messy inertia. It’s one thing to replace well-designed, well-maintained Java infra with Golang or something, but it’s quite another to try to replace some pile of shit deep learning library that your customers had to build a pile of shit on top of just to make it work, and all the while fifty college kids are working 16 hours a day to add even more in the next dev release, which will of course be wholly backwards and forwards incompatible.

            But I really hope I’m wrong :)

            • growthwtf a day ago ago

              Lattner's comment aside (which I'm fanboying a little bit at), I do tend to agree with your pessimism/realism for what it's worth. It's gonna be a long long time before that whole mess you're describing is sorted out, but I'm confident that over the next decade we will do it. There's just too much money to be made by fixing it at this point.

              I don't think it's gonna happen instantly, but it will happen, and Mojo/Modular are really the only language platform I see taking a coherent approach to it right now.

              • lqstuart 5 hours ago ago

                I tend to agree with you, but I hoped the field would start collectively figuring out how to be big boys with CICD and dependency management back in 2017–I thought Google’s awkward source release of BERT was going to be the low point, and we’d switch to Torch and be saved. Instead, it’s gotten so much worse. And the kind of work that the Python core team has been putting into package and dependency management is nothing short of heroic, and it still falls short because PyTorch extends the Python runtime itself, and now Torch compile intercepting Py_FrameEval and NVIDIA is releasing Python CUDA bindings.

                It’s just such a massive, uphill, ugly moving target to try to run down. And I sit here thinking the same as many of these comments—on the one hand, I can’t imagine we’re still using Python 3 in 2035? 2050?? But on the other hand I can’t envision a path leading out of the mess making money, or at least continue pretending they’ll start to soon.

        • wolvesechoes 2 days ago ago

          And comments like this forget that there is more to AI and ML than just LLMs or even NNs.

    • epistasis 2 days ago ago

      Pytorch didn't even start until 2016, taking a lot of market share from Tensorflow.

      I don't know if this is a language that will catch on, but I guarantee there will be another deep learning focused language that catches on in the future.

    • pjmlp 2 days ago ago

      Now that NVidia finally got serious with Python tooling and JIT compilers for CUDA, I also see it becoming harder, and those I can use natively on Windows, instead of having to be on WSL land.

    • atty 2 days ago ago

      To be fair, triton is in active use, and this should be even more ergonomic for Python users than triton. I dont think it’s a sure thing, but I wouldn’t say it has zero chance either.

    • golly_ned 2 days ago ago

      Tritonlang itself is a deep learning DSL.

    • erichocean a day ago ago

      You could have said the same about MLX on Apple Silicon, yet here we are.

    • rvz 2 days ago ago

      > I like Chris Lattner but the ship sailed for a deep learning DSL in like 2012.

      Nope. There's certainly room for another alternative that's performant and portable than the rest without the hacks needed to meet it.

      Maybe you caught the wrong ship, but Mojo is a speedboat.

      > Mojo is never going to be anything but a vanity project.

      Will come back in 10 years and we'll see if your comment needs to be studied like the one done for Dropbox.

      • adastra22 2 days ago ago

        Any actual reasoning for that claim?

  • roansh 2 days ago ago

    Apologies for a noob (and off-topic) question, but what stops Apple from competing with Nvidia?

  • rvz 2 days ago ago

    We need a Pythonic language that is compatible with the Python ecosystem designed for machine learning use-cases and compiles directly to an executable with direct specialized access to the low-level GPU cores and is a fast as Rust.

    The closest to that is Mojo and borrows many of Rust's ideas, built in type safety with the aim of being compatible with the existing Python ecosystem which is great.

    I've never heard a sound argument against Mojo and continue to see the weakest arguments that go along the lines of:

    "I don't want to learn another language"

    "It will never take off because we don't need another deep learning DSL"

    "It's bad that a single company owns the language just like Google and Golang, Microsoft and C# and Apple and Swift".

    Well I prefer tools that are extremely fast, save time and make lots of money, instead of spinning up hundreds of costly VMs as the solution. If Mojo excels in performance and reduces cost then I'm all for that, even better if it achieves Python compatibility.

    • krzat 2 days ago ago

      In an alternative reality, Chris invented Mojo at Apple (instead of Swift).

      If one language was used for iOS apps and gpu programming, with some compatibility with python, it would be pretty neat.

    • Archit3ch 2 days ago ago

      The argument against Mojo is that it replaces CUDA (that you get for free with the hardware) with something that you need to license.

      By itself, that's not so bad. Plenty of "buy, don't build" choices out there.

      However, every other would-be Mojo user also knowns that. And they don't want to build on top of an ecosystem that's not fully open.

      Why don't Mathematica/MATLAB have pytorch-style DL ecosystems? Because nobody in their right mind would contribute for free to a platform owned by Wolfram Research or Mathworks.

      I'm hopeful that Modular can navigate this by opening up their stack.

      • yowlingcat 2 days ago ago

        I really want to like Mojo but you nailed what gives me pause. Not to take an anecdotal example of Polars too far beyond, but I get the sense the current gravity in Python for net new stuff that needs to be written outside Python (obviously a ton of highly performant numpy/scipy/pytorch ecosystem stuff aside) is for it to be written in Rust when necessary.

        Not an expert, but though I wouldn't be surprised if Mojo ends up being a better language than Rust for the use case we're discussing, I'm not confident it will ever catch up to Rust in ecosystem and escape velocity as a sane general purpose compiled systems language. It really does feel like Rust has replaced C++ for net new buildouts that would've previously needed its power.

      • GeekyBear 2 days ago ago

        > The argument against Mojo is that it replaces CUDA (that you get for free with the hardware) with something that you need to license.

        You realize that CUDA isn't open source or planned to be open source in the future, right?

        Meanwhile parts of Mojo are already open source with the rest expected to be opened up next year.

        • deagle50 2 days ago ago

          parent said free, not open source. I want Mojo to succeed, but I'm also doubtful of the business model.

          • GeekyBear 2 days ago ago

            Do you get a functional version of CUDA free with AMD's much more reasonably priced hardware?

            Mojo is planned to be both free and open source by the end of next year and it's not vendor locked to extremely expensive hardware.

            • pjmlp 2 days ago ago

              To take full advantage of Mojo you will need Modular's ecosystem, and they need to pay the VCs back somehow.

              Also as of today anything CUDA works out of the box in Windows, Mojo might eventually work outside WSL, some day.

              • GeekyBear a day ago ago

                Commercial use of Mojo on Nvidia hardware is already free today.

                There is no disadvantage vs CUDA.

                • pjmlp a day ago ago

                  A language without ecosystem isn't that interesting.

    • timeon 16 hours ago ago

      > "It's bad that a single company owns the language just like Google and Golang, Microsoft and C# and Apple and Swift".

      I do not think that is same as VC-backed. Google/Microsoft/Apple need those languages for their ecosystem/infrastructure. Danger there is "just" vendor lock-in. With VC-backed language there is also possibility of enshittification.