Benchmarking Llama 3.1 405B on 8x AMD MI300X GPUs

(dstack.ai)

10 points | by latchkey 12 hours ago ago

3 comments

  • cheptsov 10 hours ago ago

    TL;DR

    - We explore how the inference performance of Llama 3.1 405B varies on 8x AMD MI300X GPUs across vLLM and TGI backends in different use cases.

    - TGI is highly efficient at handling medium to high workloads. In our tests on 8x AMD MI300X GPU, medium workloads are defined as RPS between 2 and 4. In these cases, it delivers faster time to first token (TTFT) and higher throughput.

    - Conversely, vLLM works well with lower RPS but struggles to scale, making it less ideal for more demanding workloads.

    - TGI's edge comes from its continuous batching algorithm which dynamically modifies batch sizes to optimize GPU usage.

    If you have feedback, or want to help improve the benchmark, please let me know.

    • Muhtasham 7 hours ago ago

      Thanks for detailed analysis, would be curios to see FP8 comparison too, given vllm has some custom kernels

  • 12 hours ago ago
    [deleted]