Llama 405B 506 tokens/second on an H200

(developer.nvidia.com)

18 points | by moondistance 14 hours ago ago

5 comments

  • EgoIncarnate 12 hours ago ago

    not "an H200", "In the table above, tensor parallelism is compared to pipeline parallelism with each across eight GPUs"

    • FanaHOVA 12 hours ago ago

      Title on HN is wrong. The article says GPUs and it's referring to one of their 8xH200 boxes.

  • moondistance 14 hours ago ago

    Significant further optimizations. FP8!

  • 7e 12 hours ago ago

    And this is why nobody submits MLPerf against NVIDIA.