Benchmarking Llama 3.1 405B on 8x AMD MI300X GPUs

(dstack.ai)

10 points | by latchkey 12 hours ago ago

3 comments

cheptsov 10 hours ago ago
TL;DR
- We explore how the inference performance of Llama 3.1 405B varies on 8x AMD MI300X GPUs across vLLM and TGI backends in different use cases.
- TGI is highly efficient at handling medium to high workloads. In our tests on 8x AMD MI300X GPU, medium workloads are defined as RPS between 2 and 4. In these cases, it delivers faster time to first token (TTFT) and higher throughput.
- Conversely, vLLM works well with lower RPS but struggles to scale, making it less ideal for more demanding workloads.
- TGI's edge comes from its continuous batching algorithm which dynamically modifies batch sizes to optimize GPU usage.
If you have feedback, or want to help improve the benchmark, please let me know.
[-]
12 hours ago ago
[deleted]