Llama 405B 506 tokens/second on an H200

(developer.nvidia.com)

21 points | by moondistance 9 months ago ago

5 comments

  • EgoIncarnate 9 months ago ago

    not "an H200", "In the table above, tensor parallelism is compared to pipeline parallelism with each across eight GPUs"

    • FanaHOVA 9 months ago ago

      Title on HN is wrong. The article says GPUs and it's referring to one of their 8xH200 boxes.

  • 7e 9 months ago ago

    And this is why nobody submits MLPerf against NVIDIA.

  • moondistance 9 months ago ago

    Significant further optimizations. FP8!