For inference, if you have a supported card (or probably architecture if you are on Linux and can use HSA_OVERRIDE_GFX_VERSION), then you can probably run anything with (upstream) PyTorch and transformers. Also, compiling llama.cpp is has been pretty trouble-free for me for at least a year.
(If you are on Windows, there is usually a win-hip binary of llama.cpp in the project's releases or if things totally refuse to work, you can use the Vulkan build as a (less performant) fallback).
Having more options can't hurt, but ROCm 5.4.2 is almost 2 years old, and things have come a long way since then, so I'm curious about this being published freshly today, in October 2024.
BTW, I recently went through and updated my compatibility doc (focused on RDNA3) w/ ROCm 6.2 for those interested. A lot has changed just in the past few months (upstream bitsandbytes, upstream xformers, and Triton-based Flash Attention): https://llm-tracker.info/howto/AMD-GPUs
i also have been playing with inference on the amd 7900xtx, and i agree. there are no hoops to jump through these days. just make sure to install the rocm version of torch (if using a1111 or similar, don't trust requirements.txt), as shown clearly on the pytorch homepage. obsidian is a similar story. hip is straightforward, at least on arch and ubuntu (fedora still requires some twiddling, though). i didn't realize xformers is also functional! that's good news.
I think fazkan was confused about which repo you were talking about. For the llm-tracker doc, that's something I maintain. It's based on stuff I test but if you want to submit a PR or issue w/ info in a way that I can verify then I'm happy to add a Docker section.
haha, I was a bit confused, but I was referring to this one https://github.com/slashml/amd_inference. But the comment applies to other repos as well, do open issues in them, helps the maintainers prioritize features.
I also had to go to therapy to cure myself of the misunderstanding that data scientists and machine learning folks are software engineers, and expecting the same work product from those disparate audiences only raises your blood pressure
Expectation management is a huge part of any team/organization, I think
They can be the same or different, given how you define them. People throw these words around with little thought, especially ones superficial to or outside the field.
I wouldn't disparage an entire field for lack of a clear definition in the buzzwords people use to refer to it.
While I see where you are coming from, these are the types of comments that keep people from sharing their code, contributing to OSS or continuing to program in general.
It seems to use an old, 2 year old version of ROCm (5.4.2) which I'm doubtful would support my RX 7900 XTX. I personally found it easiest to just use the latest `rocm/pytorch` image and run what I need from there
The RX 7900 XTX (gfx1100) was first enabled in the math libraries (e.g. rocBLAS) for ROCm 5.4, but I don't think the AI libraries (e.g. MIOpen) had it enabled until ROCm 5.5. I believe the performance improved significantly in later releases, as well.
On Ubuntu 24.04 (and Debian Unstable¹), the OS-provided packages should be able to get llama.cpp running on ROCm on just about any discrete AMD GPU from Vega onwards²³⁴. No docker or HSA_OVERRIDE_GFX_VERSION required. The performance might not be ideal in every case⁵, but I've tested a wide variety of cards:
# install dependencies
sudo apt -y update
sudo apt -y upgrade
sudo apt -y install git wget hipcc libhipblas-dev librocblas-dev cmake build-essential
# ensure you have permissions by adding yourself to the video and render groups
sudo usermod -aG video,render $USER
# log out and then log back in to apply the group changes
# you can run `rocminfo` and look for your GPU in the output to check everything is working thus far
# download a model, build llama.cpp, and run it
wget https://huggingface.co/TheBloke/dolphin-2.2.1-mistral-7B-GGUF/resolve/main/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf?download=true -O dolphin-2.2.1-mistral-7b.Q5_K_M.gguf
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
git checkout b3267
HIPCXX=clang-17 cmake -H. -Bbuild -DGGML_HIPBLAS=ON -DCMAKE_HIP_ARCHITECTURES="gfx803;gfx900;gfx906;gfx908;gfx90a;gfx1010;gfx1030;gfx1100;gfx1101;gfx1102" -DCMAKE_BUILD_TYPE=Release
make -j16 -C build
build/bin/llama-cli -ngl 32 --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -m ../dolphin-2.2.1-mistral-7b.Q5_K_M.gguf --prompt "Once upon a time"
I'd suggest RDNA 3, MI200 and MI300 users should probably use the AMD-provided ROCm packages for improved performance. Users that need PyTorch should also use the AMD-provided ROCm packages, as PyTorch has some dependencies that are not available from the system packages. Still, you can't beat the ease of installation or the compatibility with older hardware provided by the OS packages.
¹ https://lists.debian.org/debian-ai/2024/07/msg00002.html
² Not including MI300 because that released too close to the Ubuntu 24.04 launch.
³ Pre-Vega architectures might work, but have known bugs for some applications.
⁴ Vega and RDNA 2 APUs might work with Linux 6.10+ installed. I'm in the process of testing that.
⁵ The version of rocBLAS that comes with Ubuntu 24.04 is a bit old and therefore lacks some optimizations for RDNA 3. It's also missing some MI200 optimizations.
I was able to install (AMD provided) ROCm and Ollama on Ubuntu 22.04.5 with an RX 7900 XTX with no real problems to speak of, and I can execute LLMs using Ollama on ROCm just fine. Take that FWIW.
The Radeon Pro W6800, W7800 or W7900 would be the standard answer. A hacker-spirited alternative would be to purchase a used MI50, MI60 or MI100 and 3d print a fan adapter. There are versions of all of those cards with 32GB of VRAM and they can be found on ebay for between 350 USD and 1200 USD. Plus twenty bucks for a fan adapter and a fan.
Those old gfx906 or gfx908 cards are more competitive for fp64 than for low-precision AI workloads, but they have the memory and the price is right. I'm not sure I would recommend the hacker approach to the average user, but it is what I've done for some of the continuous integration servers I host for the Debian project.
It sort of depends on how you define "consumer friendly prices". AFAIK, in the $1000 - "slightly over or under $1000" range, 24GB is all you can get. But there are Radeon Pro boards with 32GB or 48GB of RAM for various prices between around $2000 to about $3500. So not "cheap" but possibly within reach for a serious hobbyist who doesn't mind spending a little bit more.
It has been like 8 months since I got Ryzen 8700G with NPU just for the purpose of inferencing NN, and so far only acceleration I'm getting is through vulkan on iGPU, not NPU (I'm using Linux only). On the bright side, with 64GB of RAM had no isues with trying models over 32GB. Kudos to llama.cpp for supporting vulkan backend!
You should have ROCm/HIP support on the iGPU as well, be sure to compile llama.cpp w/ the LLAMA_HIP_UMA=1 flag. If you take a look at https://github.com/amd/RyzenAI-SW you can see there's a fair amount of software to play with on the NPU now, but Phoenix is only 16 TOPS, so I've never bothered testing it.
So no doubt modern software is ridiculously bloated, but ROCm isn't just a GPU driver. It includes all sorts of tools and libraries as well.
By comparison, if you go and download the CUDA toolkit as a single file, you get a download file that's over 4GB, so quite a bit larger than the download size you quoted. I haven't checked how much that expands to (it seems the ROCm install has a lot of redundancy given how well it compresses), but the point is, you get something that seems insanely large either way.
The biggest one just to pick on one is hipblaslt is "a library that provides general matrix-matrix operations. It has a flexible API that extends functionalities beyond a traditional BLAS library, such as adding flexibility to matrix data layouts, input types, compute types, and algorithmic implementations and heuristics." https://github.com/ROCm/hipBLASLt
There are mostly GPU kernels that by themselves aren't so big, but for every single operation x every single supported graphics architecture, eg:
My understanding is that ROCm contains all included kernels for each supported architecture, so it would have (made up):
-- matrix multiply 2048x2048 for Navi 31,
-- same for Navi 32,
-- same for Navi 33,
-- same for Navi 21,
-- same for Navi 22,
-- same for Navi 23,
-- same for Navi 24, etc.
-- matrix multiply 4096x4096 for Navi 31,
-- ...
Correct. Although, you wouldn't find Navi 22, 23 or 24 in the list because those particular architectures are not supported. Instead, you'd see Vega 10, Vega 20, Arcturus, Aldebaran, Aqua Vanjaram and sometimes Polaris.
We're working on a few different strategies to reduce the binary size. It will get worse before it gets better, but I think you can expect significant improvements in the future. There are lots of ways to slim the libraries down.
Wait, looking at that link I don't see how it avoids downloading CUDA or ROCM. Do you use MLIR to compile to GPU without using the vendor provided tooling at all?
> Of course, much of that is auto-generated header files... A large portion of it with AMD continuing to introduce new auto-generated header files with each new generation/version of a given block. These verbose header files has been AMD's alternative to creating exhaustive public documentation on their GPUs that they were once known for.
What's the best bang-for-your-buck AMD GPU these days? I just bought 2 used 3090s for $750ish refurb'd on eBay. Curious what others are using for running LLMs locally.
I bought an MI100 recently for $650. 32GB of HBM2 and it performs around around 0-5% faster than a 3090 on the default flash attention 2 benchmarks. Performance on actual applications can be mixed though, as many are not well optimised for CDNA's matrix cores - even where work has been done for RDNA, which is not that often, it doesn't necessarily carry over. It's also frustrating when efforts to improve performance get turned back by maintainers: llama.cpp closing PR for flash attention on AMD because the requisite (header-only) lib is supposedly adding an unneeded dependency (https://github.com/ggerganov/llama.cpp/pull/7011).
There's also a few tricks/updates I'd like to try which may improve performance, e.g. hipblaslt support being added next rocm release - of course these are "maybes".
To give you a rough idea of practical performance, default SDXL with xformers is around 4.5-5it/s (between 3090 and 4090 from my understanding), and exllamav2 with qwen 72B at 3bpw is around 7t/s (slower than a 3090, though a 3090 has to use a lower precision to fit).
As others have pointed out, I can't really see what this project offers for AMD users over existing options like llama.cpp, exllamav2, mlc-ai, etc. Most projects work relatively easily these days.
Personal experience: It's not even worth it. AMD (i)GPU breaks with every pytorch, ROCm, xformers, or ollama updates. You'll sleep more compfortably at night.
When dealing with ROCM, it's critical that once you have a working configuration, you freeze everything in place (except your application). Docker is one way to achieve this if your host machine is subject to kernel or package updates
I don't really have any problem with ROCm these days, although I only use system packages. It used to be quite wonky though, and I've totally given up on custom ROCm installs.
I got my radeon pro vii for €300 new. Was not a bad deal IMO especially since it comes with HBM2 and has the same memory bandwidth as the 4090 (1TB/s). It's got only 16GB though.
I doubt it, but the 3090 is a four year old card which means it might have a lot of mileage from the previous owner. A lot of them are from mining rigs.
You can do inference from a Docker container, just as you'd do it with NVidia. OpenAI runs a K8s cluster doing this. I have personally only worked with NVidia, but the docs are present for AMD too.
Like anything AI and AMD, you need the right card(s) and rocm version along with sheer dumb luck to get it working. AMD has Docker images with rocm support, so you could merge your app in with that as the base layer. Just pass through the GPU to the container and you should get it working.
It might just be the software in a Docker image, but it removes a variable I would otherwise have to worry about during deployment. It literally is inference on AMD with Docker, if that's what you meant.
Docker became part of the standard toolkit for ML because deploying Python that links to underlying system libraries is a gong show unless you ship that layer too.
Even Docker doesn't guarantee reproducible results due to sensitivity towards host GPU drivers, and ML frontends/integrations bringing their own "helpful" newby-friendly all-in-one dependency checks and updater services.
Yeah, they're using docker to wrap up the software packages, which is what Docker is used for. I don't understand why that confuses you or what you think Docker is otherwise used for.
I'm pretty comfortable with Docker/cgroups/namespaces, I have quite a deep understanding of it. But I read "Docker-based inference" like you literally took Docker code to... do inference? The wording in my opinion doesn't make much sense. It's like saying, I don't know, "Flatpak-based inference" or "SSD-based inference". Semantics.
If you're interested in how much the AMD graphics cards cost compared to the NVidia ones, I have https://gpuquicklist.com/ which gives you a quick table view of lowest prices available on Amazon that I can find. </ selfpromotion>
There are some magic environment variables you want to set to get ROCM to work with this technically unsupported APU:
HSA_OVERRIDE_GFX_VERSION=9.0.0
HSA_ENABLE_SDMA=0
Performance is not great, but slightly better than running inference on the CPU, with the bonus that your CPU is essentially free for other tasks even while running LLMs.
is anyone using the new HX370 based laptops for any LLM work ? i mean the ipex-llm libraries of Intel's new Lunar Lake is already supporting Llama 3.2 (https://www.intel.com/content/www/us/en/developer/articles/t...), but AMD's new Zen5 chips dont seem to be much active here.
I'm all for having more open source projects, but I do not see how it can be useful in this ecosystem, especially for people with newer AMD GPUs (not supported in this project) which are already supported in most popular projects?
we have vllm in certin production instances, it is a pain for most non-nvidia related architectures. A bit of digging around and we realized that most of it is just a wrapper on top of pytorch function calls. If we can do away with batch processing with vllm supports, we can be good, this is what we did here.
driver mismatch issues, we mostly use publicly available instances, so the drivers change as the instances change, according to their base image. Not saying it won't work, but it was more painful to figure out vllm, than to write a simple inference script and do it ourselves.
For inference, if you have a supported card (or probably architecture if you are on Linux and can use HSA_OVERRIDE_GFX_VERSION), then you can probably run anything with (upstream) PyTorch and transformers. Also, compiling llama.cpp is has been pretty trouble-free for me for at least a year.
(If you are on Windows, there is usually a win-hip binary of llama.cpp in the project's releases or if things totally refuse to work, you can use the Vulkan build as a (less performant) fallback).
Having more options can't hurt, but ROCm 5.4.2 is almost 2 years old, and things have come a long way since then, so I'm curious about this being published freshly today, in October 2024.
BTW, I recently went through and updated my compatibility doc (focused on RDNA3) w/ ROCm 6.2 for those interested. A lot has changed just in the past few months (upstream bitsandbytes, upstream xformers, and Triton-based Flash Attention): https://llm-tracker.info/howto/AMD-GPUs
i also have been playing with inference on the amd 7900xtx, and i agree. there are no hoops to jump through these days. just make sure to install the rocm version of torch (if using a1111 or similar, don't trust requirements.txt), as shown clearly on the pytorch homepage. obsidian is a similar story. hip is straightforward, at least on arch and ubuntu (fedora still requires some twiddling, though). i didn't realize xformers is also functional! that's good news.
It would be great if you included a section on running with Docker on Linux. The only one that worked out of the box was Ollama, and it had an example. https://github.com/ollama/ollama/blob/main/docs/docker.md
has a docker image but no examples to run it https://github.com/ggerganov/llama.cpp/blob/master/docs/dock...
has a docker image but no examples to run it https://github.com/LostRuins/koboldcpp?tab=readme-ov-file#do...
docker image was broken for me on 7800xt running rhel9 https://github.com/Atinoda/text-generation-webui-docker
good feedback thanks, would you be able to open an issue
this repo? https://github.com/AUGMXNT/llm-tracker.info-vault/issues
I think fazkan was confused about which repo you were talking about. For the llm-tracker doc, that's something I maintain. It's based on stuff I test but if you want to submit a PR or issue w/ info in a way that I can verify then I'm happy to add a Docker section.
haha, I was a bit confused, but I was referring to this one https://github.com/slashml/amd_inference. But the comment applies to other repos as well, do open issues in them, helps the maintainers prioritize features.
related: https://www.nonbios.ai/post/deploying-large-405b-models-in-f...
tldr: uses the latest rocm 6.2 to run full precision inference for llama 405b on a single node 8 x MI300x AMD GPU
How mature do you think Rocm 6.2-AMD stack is compared to Nvidia ?
this uses vllm?
Yes.
The rise of generated slop ml libraries is staggering.
This library is 50% print statements. And where it does branch, it doesn't even need to.
Defines two environment variables and sets two flags on torch.
I also had to go to therapy to cure myself of the misunderstanding that data scientists and machine learning folks are software engineers, and expecting the same work product from those disparate audiences only raises your blood pressure
Expectation management is a huge part of any team/organization, I think
They can be the same or different, given how you define them. People throw these words around with little thought, especially ones superficial to or outside the field.
I wouldn't disparage an entire field for lack of a clear definition in the buzzwords people use to refer to it.
As suggested by my original comment, I read the source of a lot of the open source projects around ML.
mdaniel is absolutely correct. They are not software engineers.
I'm not vocal about the naïve stuff, poor design, sloppy formatting, bad english. I am vocal about projects that have no place in the ecosystem.
I thought you were being overly harsh until I looked at the repo. You're not kidding, there's very little to it.
While I see where you are coming from, these are the types of comments that keep people from sharing their code, contributing to OSS or continuing to program in general.
It seems to use an old, 2 year old version of ROCm (5.4.2) which I'm doubtful would support my RX 7900 XTX. I personally found it easiest to just use the latest `rocm/pytorch` image and run what I need from there
The RX 7900 XTX (gfx1100) was first enabled in the math libraries (e.g. rocBLAS) for ROCm 5.4, but I don't think the AI libraries (e.g. MIOpen) had it enabled until ROCm 5.5. I believe the performance improved significantly in later releases, as well.
On Ubuntu 24.04 (and Debian Unstable¹), the OS-provided packages should be able to get llama.cpp running on ROCm on just about any discrete AMD GPU from Vega onwards²³⁴. No docker or HSA_OVERRIDE_GFX_VERSION required. The performance might not be ideal in every case⁵, but I've tested a wide variety of cards:
I'd suggest RDNA 3, MI200 and MI300 users should probably use the AMD-provided ROCm packages for improved performance. Users that need PyTorch should also use the AMD-provided ROCm packages, as PyTorch has some dependencies that are not available from the system packages. Still, you can't beat the ease of installation or the compatibility with older hardware provided by the OS packages.¹ https://lists.debian.org/debian-ai/2024/07/msg00002.html ² Not including MI300 because that released too close to the Ubuntu 24.04 launch. ³ Pre-Vega architectures might work, but have known bugs for some applications. ⁴ Vega and RDNA 2 APUs might work with Linux 6.10+ installed. I'm in the process of testing that. ⁵ The version of rocBLAS that comes with Ubuntu 24.04 is a bit old and therefore lacks some optimizations for RDNA 3. It's also missing some MI200 optimizations.
I was able to install (AMD provided) ROCm and Ollama on Ubuntu 22.04.5 with an RX 7900 XTX with no real problems to speak of, and I can execute LLMs using Ollama on ROCm just fine. Take that FWIW.
are there AMD cards with more than 24GB VRAM on the market right now at consumer friendly prices?
The Radeon Pro W6800, W7800 or W7900 would be the standard answer. A hacker-spirited alternative would be to purchase a used MI50, MI60 or MI100 and 3d print a fan adapter. There are versions of all of those cards with 32GB of VRAM and they can be found on ebay for between 350 USD and 1200 USD. Plus twenty bucks for a fan adapter and a fan.
Those old gfx906 or gfx908 cards are more competitive for fp64 than for low-precision AI workloads, but they have the memory and the price is right. I'm not sure I would recommend the hacker approach to the average user, but it is what I've done for some of the continuous integration servers I host for the Debian project.
Amazon prices:
$3,600 - 61 TFLOPS - AMD Radeon Pro W7900
$4,200 - 38.7 TFLOPS - NVidia RTX A6000 48GB Ampere
$7,200 - 91.1 TFLOPS - NVidia RTX A6000 48GB Ada
It sort of depends on how you define "consumer friendly prices". AFAIK, in the $1000 - "slightly over or under $1000" range, 24GB is all you can get. But there are Radeon Pro boards with 32GB or 48GB of RAM for various prices between around $2000 to about $3500. So not "cheap" but possibly within reach for a serious hobbyist who doesn't mind spending a little bit more.
It has been like 8 months since I got Ryzen 8700G with NPU just for the purpose of inferencing NN, and so far only acceleration I'm getting is through vulkan on iGPU, not NPU (I'm using Linux only). On the bright side, with 64GB of RAM had no isues with trying models over 32GB. Kudos to llama.cpp for supporting vulkan backend!
You should have ROCm/HIP support on the iGPU as well, be sure to compile llama.cpp w/ the LLAMA_HIP_UMA=1 flag. If you take a look at https://github.com/amd/RyzenAI-SW you can see there's a fair amount of software to play with on the NPU now, but Phoenix is only 16 TOPS, so I've never bothered testing it.
So, this is all I needed to add to NixOS workstation:
I almost tried to install AMD rocm a while ago after discovering the simplicity of llamafile.
I don't understand how 36 GB can be justified for what amounts to a GPU driver.So no doubt modern software is ridiculously bloated, but ROCm isn't just a GPU driver. It includes all sorts of tools and libraries as well.
By comparison, if you go and download the CUDA toolkit as a single file, you get a download file that's over 4GB, so quite a bit larger than the download size you quoted. I haven't checked how much that expands to (it seems the ROCm install has a lot of redundancy given how well it compresses), but the point is, you get something that seems insanely large either way.
I suspected that, but any binaries being that large just seems wrong, I mean the whole thing is 35 time larger than my entire OS install.
Do you know what is included in ROCm that could be so big? Does it include training datasets or something?
Here's the big files in my /opt/rocm/lib which is most of it:
The biggest one just to pick on one is hipblaslt is "a library that provides general matrix-matrix operations. It has a flexible API that extends functionalities beyond a traditional BLAS library, such as adding flexibility to matrix data layouts, input types, compute types, and algorithmic implementations and heuristics." https://github.com/ROCm/hipBLASLtThere are mostly GPU kernels that by themselves aren't so big, but for every single operation x every single supported graphics architecture, eg:
Ok so like four of those files literally just do matrix multiplications
"just"
Ok some of them do tensor contractions too my bad
My understanding is that ROCm contains all included kernels for each supported architecture, so it would have (made up):
Correct. Although, you wouldn't find Navi 22, 23 or 24 in the list because those particular architectures are not supported. Instead, you'd see Vega 10, Vega 20, Arcturus, Aldebaran, Aqua Vanjaram and sometimes Polaris.
We're working on a few different strategies to reduce the binary size. It will get worse before it gets better, but I think you can expect significant improvements in the future. There are lots of ways to slim the libraries down.
You can look us up at https://github.com/zml/zml, we fix that.
Wait, looking at that link I don't see how it avoids downloading CUDA or ROCM. Do you use MLIR to compile to GPU without using the vendor provided tooling at all?
We do use ROCm and CUDA. Only we sandbox it with the model and download only the needed parts which are about 1/10th of the size.
CPU drivers are complete OSes that run on the GPUs now.
It's not just you; AMD manages to completely shit-up the Linux kernel with their drivers: https://www.phoronix.com/news/AMD-5-Million-Lines
> Of course, much of that is auto-generated header files... A large portion of it with AMD continuing to introduce new auto-generated header files with each new generation/version of a given block. These verbose header files has been AMD's alternative to creating exhaustive public documentation on their GPUs that they were once known for.
There have been talks about moving those headers to a separate repo and only including the needed headers upstream[1]
[1]: https://gitlab.freedesktop.org/drm/amd/-/issues/3636
OpenBSD, too.
This seems to be some AI generated wrapper around a wrapper of a wrapper.
> # Other AMD-specific optimizations can be added here
> # For example, you might want to set specific flags or use AMD-optimized libraries
What are we doing here, then?
its just a big requirements file, and a dockerfile :) the rest are mostly helper scripts.
What's the best bang-for-your-buck AMD GPU these days? I just bought 2 used 3090s for $750ish refurb'd on eBay. Curious what others are using for running LLMs locally.
I bought an MI100 recently for $650. 32GB of HBM2 and it performs around around 0-5% faster than a 3090 on the default flash attention 2 benchmarks. Performance on actual applications can be mixed though, as many are not well optimised for CDNA's matrix cores - even where work has been done for RDNA, which is not that often, it doesn't necessarily carry over. It's also frustrating when efforts to improve performance get turned back by maintainers: llama.cpp closing PR for flash attention on AMD because the requisite (header-only) lib is supposedly adding an unneeded dependency (https://github.com/ggerganov/llama.cpp/pull/7011).
There's also a few tricks/updates I'd like to try which may improve performance, e.g. hipblaslt support being added next rocm release - of course these are "maybes".
To give you a rough idea of practical performance, default SDXL with xformers is around 4.5-5it/s (between 3090 and 4090 from my understanding), and exllamav2 with qwen 72B at 3bpw is around 7t/s (slower than a 3090, though a 3090 has to use a lower precision to fit).
As others have pointed out, I can't really see what this project offers for AMD users over existing options like llama.cpp, exllamav2, mlc-ai, etc. Most projects work relatively easily these days.
Personal experience: It's not even worth it. AMD (i)GPU breaks with every pytorch, ROCm, xformers, or ollama updates. You'll sleep more compfortably at night.
When dealing with ROCM, it's critical that once you have a working configuration, you freeze everything in place (except your application). Docker is one way to achieve this if your host machine is subject to kernel or package updates
I don't really have any problem with ROCm these days, although I only use system packages. It used to be quite wonky though, and I've totally given up on custom ROCm installs.
Same here on my 7900 XTX. Used to be terrible, now it's (seemingly) fine (for now).
Thats our observation, which is why we wrote the scripts ourselves that way we can control the dependencies at least.
It's not the experience I have. I've been using ollama for 6 months on mine and never had any issues with ROCm breaking.
I got my radeon pro vii for €300 new. Was not a bad deal IMO especially since it comes with HBM2 and has the same memory bandwidth as the 4090 (1TB/s). It's got only 16GB though.
Probably the 7900xtx. $1k for 24GB of RAM.
That's about the same price as a 3090 and it's also 24GB. Are they faster at inference?
it is not, at least in llama.cpp/llamafile
https://benchmarks.andromeda.computer/compare
According to that benchmark a 7900xtx is equal to a 2080ti in performance.
I doubt it, but the 3090 is a four year old card which means it might have a lot of mileage from the previous owner. A lot of them are from mining rigs.
People use "Docker-based" all the time but what they mean is that they ship $SOFTWARE in a Docker image.
"Docker-based" reads, to me, as if you were doing Inference on AMD cards with Docker somehow, which doesn't make sense.
You can do inference from a Docker container, just as you'd do it with NVidia. OpenAI runs a K8s cluster doing this. I have personally only worked with NVidia, but the docs are present for AMD too.
Like anything AI and AMD, you need the right card(s) and rocm version along with sheer dumb luck to get it working. AMD has Docker images with rocm support, so you could merge your app in with that as the base layer. Just pass through the GPU to the container and you should get it working.
It might just be the software in a Docker image, but it removes a variable I would otherwise have to worry about during deployment. It literally is inference on AMD with Docker, if that's what you meant.
Docker became part of the standard toolkit for ML because deploying Python that links to underlying system libraries is a gong show unless you ship that layer too.
Even Docker doesn't guarantee reproducible results due to sensitivity towards host GPU drivers, and ML frontends/integrations bringing their own "helpful" newby-friendly all-in-one dependency checks and updater services.
you can mount a specific device to docker. If you read the script, we are mounting GPUs
https://github.com/slashml/amd_inference/blob/main/run-docke...
Hi, we (ZML), fix that: https://github.com/zml/zml
Works out of the box on our MI300x. Fantastic work steeve!
https://x.com/HotAisle/status/1842245896085356949
This is pretty cool. Is there a document that shows which AMD drivers are supported out of the box?
We are in line with ROCm 6.2 support. We actually just opened a PR to bump to 6.2.2: https://github.com/zml/zml/pull/39
Why doesn’t it make sense? You can talk to devices from a Docker container - you just have to attach it.
Yeah, they're using docker to wrap up the software packages, which is what Docker is used for. I don't understand why that confuses you or what you think Docker is otherwise used for.
I'm pretty comfortable with Docker/cgroups/namespaces, I have quite a deep understanding of it. But I read "Docker-based inference" like you literally took Docker code to... do inference? The wording in my opinion doesn't make much sense. It's like saying, I don't know, "Flatpak-based inference" or "SSD-based inference". Semantics.
If you're interested in how much the AMD graphics cards cost compared to the NVidia ones, I have https://gpuquicklist.com/ which gives you a quick table view of lowest prices available on Amazon that I can find. </ selfpromotion>
Does it work with an APU? I just put 64GB in my system and gonna drop in a 5700G. Will that be enough? SFF inference if so.
I'm able to run Ollama and llama.cpp on my Ryzen 4600G APU following this guide: https://agieverywhere.com/apuguide/AMDAPU/APU_Linux
Your APU should be similar, just faster.
There are some magic environment variables you want to set to get ROCM to work with this technically unsupported APU: HSA_OVERRIDE_GFX_VERSION=9.0.0 HSA_ENABLE_SDMA=0
Performance is not great, but slightly better than running inference on the CPU, with the bonus that your CPU is essentially free for other tasks even while running LLMs.
The integrated GPU of the 5700G uses old architecture from 2017, this one: https://en.wikipedia.org/wiki/Radeon_RX_Vega_series Pretty sure it does not support ROCm.
BTW if you just want to play with a local LLM, you can try my old port of Mistral: https://github.com/Const-me/Cgml/tree/master/Mistral/Mistral... Unlike CUDA or ROCm my port is based on Direct3D 11 GPU API, runs on all GPUs regardless of the brand.
@Const-me according to this it should work, https://github.com/ROCm/ROCm/issues/2216
haven't tested it, but it should according to https://github.com/ROCm/ROCm/issues/2216
You just need to update the version check here
https://github.com/slashml/amd_inference/blob/4b9ec069c4b2ac...
feel free to open an issue, with the requirements and we will test it.
is anyone using the new HX370 based laptops for any LLM work ? i mean the ipex-llm libraries of Intel's new Lunar Lake is already supporting Llama 3.2 (https://www.intel.com/content/www/us/en/developer/articles/t...), but AMD's new Zen5 chips dont seem to be much active here.
Why ROCm 5.4, and not the latest (6.2)?
https://github.com/slashml/amd_inference/blob/main/Dockerfil...
Also looks like the Docker image provided by this project doesn't successfully build: https://github.com/slashml/amd_inference/issues/2
tested with the old Rocm, created a PR with the latest.
Anyone else with an Intel Arc card idle waiting for some support?
I'm all for having more open source projects, but I do not see how it can be useful in this ecosystem, especially for people with newer AMD GPUs (not supported in this project) which are already supported in most popular projects?
Just something that, we found helpful, support for new architectures is just a package update. This is more of a cookie cutter
Sad that RDNA2 cards aren't supported. Not even that old!
Isn't this just a wrapper for huggingface-transformers?
yes, but handles all the dependencies for AMD architecture. So technically its just a requirements file :). Author of the repo above.
Why would you use this over vLLM?
we have vllm in certin production instances, it is a pain for most non-nvidia related architectures. A bit of digging around and we realized that most of it is just a wrapper on top of pytorch function calls. If we can do away with batch processing with vllm supports, we can be good, this is what we did here.
Batching is how you get ~350 tokens/sec on Qwen 14b on vLLM (7900XTX). By running 15 requests at once.
Also, there is a Dockerfile.rocm at the root of vLLM's repo. How is it a pain?
driver mismatch issues, we mostly use publicly available instances, so the drivers change as the instances change, according to their base image. Not saying it won't work, but it was more painful to figure out vllm, than to write a simple inference script and do it ourselves.
Will this work on older cards such as RX 570? Does anyone know?
Does it work with GGUF files?
how about they follow up 7900 XTX with a card that actually has some VRAM
They prefer you pay $3,600 for AMD Radeon Pro W7900, 48GB VRAM.
... which also has a much lower power cap
Not that much lower, 295W vs 355W, and for LLM inference VRAM bandwidth is the main bottleneck. But the price is ridiculous.
Are we supposed to use AMD GPUs for this to work? Or Does it work on any GPU?
> This project provides a Docker-based inference engine for running Large Language Models (LLMs) on AMD GPUs.
First sentence of the README in the repo. Was it somehow unclear?