29 comments

  • fhdsgbbcaA 2 days ago ago

    Looks like LLM inference will follow the same path as Bitcoin: CPU -> GPU -> FPGA -> ASIC.

    • hackernudes 2 days ago ago

      I really doubt it. Bitcoin mining is quite fixed, just massive amounts of SHA256. On the other hand, ASICs for accelerating matrix/tensor math are already around. LLM architecture is far from fixed and currently being figured out. I don't see an ASIC any time soon unless someone REALLY wants to put a specific model on a phone or something.

      • YetAnotherNick 2 days ago ago

        Google's TPU is an ASIC and performs competitively. Also Tesla and Meta is building something AFAIK.

        Although I doubt you could get lot better as GPUs already have half the die area reserved for matrix multiplication.

        • danielmarkbruce 2 days ago ago

          It depends on your precise definition of ASIC. The FPGA thing here would be analogous to an MSIC where m = model.

          It's clearly different to build a chip for a specific model than what a TPU is.

          Maybe we'll start seeing MSICs soon.

          • YetAnotherNick 2 days ago ago

            LLMs and many other models spend 99% of the FLOPs in matrix multiplication. And TPU initially had just single operation i.e. multiply matrix. Even if the MSIC is 100x better than GPU in other operations, it would just be 1% faster overall.

            • danielmarkbruce 2 days ago ago

              You can still optimize various layers of memory for a specific model, make it all 8 bit or 4 bit or whatever you want, maybe burn in a specific activation function, all kinds of stuff.

              No chance you'd only get 1% speedup on a chip designed for a specific model.

      • pzo 2 days ago ago

        Apple has Neural Engine and it really speeds up many CoreML models - if most operators are implemented in NPU inference will be significantly faster than on GPU on my Macbook m2 max (and they have similar NPU as those in e.g. iPhone 13). Those ASIC NPU just implements many typical low level operators used in most ML models.

      • imtringued 2 days ago ago

        99% of the time is spent on matrix matrix or matrix vector calculation. Activation functions, softmax, RoPE, etc basically cost nothing in comparison.

        Most NPUs are programmable, because the bottleneck is data SRAM and memory bandwidth instead of instruction SRAM.

        For classic matrix matrix multiplication, the SRAM bottleneck is the number of matrix outputs you can store in SRAM. N rows and M columns get you N X M accumulator outputs. The calculation of the dot product can be split into separate steps without losing the N X M scaling, so the SRAM consumed by the row and column vectors is insignificant in the limit.

        For the MLP layers in the unbatched case, the bottleneck lies in the memory bandwidth needed to load the model parameters. The problem is therefore how fast your DDR, GDDR, HBM memory and your NoC/system bus lets you transfer data to the NPU.

        Having a programmable processor that controls the matrix multiplication function unit costs you silicon area for the instruction SRAM. For matrix vector multiplication, the memory bottleneck is so big, it doesn't matter what architecture you are using, even CPUs are fast enough. There is no demand for getting rid of the not very costly instruction SRAM.

        "but what about the area taken up by the processor itself?"

        HAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHA. Nice joke

        Wait..., you were serious? The area taken up by an in order VLIW/TTA processor is so insignificant I jammed it in-between the routing gap of two SRAM blocks. Sure, the matrix multiplication unit might take up some space, bit decoding instructions is such an insignificant cost that anyone opposing programmability must have completely different goals and priorities than LLMs or machine learning.

    • winwang 2 days ago ago

      As far as I understand, the main issue for LLM inference is memory bandwidth and capacity. Tensor cores are already an ASIC for matmul, and they idle half the time waiting on memory.

    • evanjrowley 2 days ago ago

      You forgot to place "vertically-integrated unobtanium" after ASIC.

      • namibj 2 days ago ago

        Soooo.... TPUv4?

        • evanjrowley a day ago ago

          Yes, but the kinds that aren't on the market.

    • bee_rider 2 days ago ago

      LLM inference is a small task built into some other program you are running, right? Like an office suite with some sentence suggestion feature, probably a good use for an LLM, would be… mostly office suite, with a little LLM inference sprinkled in.

      So, the “ASIC” here is probably the CPU with, like, slightly better vector extensions. AVX1024-FP16 or something, haha.

      • p1esk 2 days ago ago

        would be… mostly office suite, with a little LLM inference sprinkled in.

        No, it would be LLM inference with a little bit of an office suite sprinkled in.

  • bitdeep 2 days ago ago

    Not sure if you guys know: Groq already doing this with their ASIC chips. So... the already passed FPGA phase and is on ASICs phase.

    The problem is: seems that their costs is 1x or 2x from what they are charging.

    • latchkey 2 days ago ago

      Probably more than 2x...

      "Semi analysis did some cost estimates, and I did some but you’re likely paying somewhere in the 12 million dollar range for the equipment to serve a single query using llama-70b. Compare that to a couple of gpus, and it’s easy to see why they are struggling to sell hardware, they can’t scale down.

      Since they didn’t use hbm, you need to stich enough cards together to get the memory to hold your model. It takes a lot of 256mb cards to get to 64gb, and there isn’t a good way to try the tech out since a single rack really can’t serve an LLM."

      https://news.ycombinator.com/item?id=39966620

    • faangguyindia 2 days ago ago

      Groq is unpredictable and while it might be fast for some requests about it's super slow or fails on others.

      Fastest commercial model is Google's Gemini Flash (predictable speed)

    • qwertox 2 days ago ago

      The way I see it, is that one day we'll be buying small LLM cartridges.

  • jsheard 2 days ago ago

    Is there any particular reason you'd want to use an FPGA for this? Unless your problem space is highly dynamic (e.g. prototyping) or you're making products in vanishing low quantities for a price insensitive market (e.g. military) an ASIC is always going to be better.

    There doesn't seem to be much flux in the low level architectures used for inferencing at this point, so may as well commit to an ASIC, as is already happening with Apple, Qualcomm, etc building NPUs into their SOCs.

    • PaulHoule 2 days ago ago

      (1) Academics could make an FPGA but not an ASIC, (2) FPGA is a first step to make an ASIC

    • wongarsu 2 days ago ago

      This specific project looks like a case of "we have this platform for automotive and industrial use, running Llama on the dual-core ARM CPU is slow but there's an FPGA right next to it". That's all the justification you really need for a university project.

      Not sure how useful this is for anyone who isn't already locked into this specific architecture. But it might be a useful benchmark or jumping-off-point for more useful FPGA-based accelerators, like ones optimized for 1 bit or 1.58 bit LLMs

    • israrkhan 2 days ago ago

      You can open-source your FPGA designs for wider collaboration with the community? wider collaboration. Also, FPGA is the starting step to make any modern digital chip.

    • someguydave 2 days ago ago

      gotta prototype the thing somewhere. If it turns out that the LLM algos become pretty mature I suspect accelerators of all kinds will be baked into silicon, especially for inference.

      • jsheard 2 days ago ago

        That's the thing though, we're already there. Every new consumer ARM and x86 ASIC is shipping with some kind of NPU, the time for tentatively testing the waters with FPGAs was a few years ago before this stuff came to market.

        • PaulHoule 2 days ago ago

          But the NPU might be poorly designed for your model or workload or just poorly designed.

      • mistrial9 2 days ago ago
    • danielmarkbruce 2 days ago ago

      Model architecture changes fast. Maybe it will slow down.

  • KeplerBoy 2 days ago ago

    4 times as efficient as on the SoC's low end arm cores, soo many times less efficient than on modern GPUs I guess?

    Not that I was expecting GPU like efficiency from a fairly small scale FPGA project. Nvidia engineers spent thousands of man-years making sure that stuff works well on GPUs.

  • rldjbpin 2 days ago ago

    as of now there are way too many parallel developments across abstraction layers, hardware or software, to really have the best combo just yet. even this example is for an older architecture because certain things just move slower than others.

    but when things plateau off, this, then ASICs, would probably be the most efficient way ahead for "stable" versions of AI models during inference.