18 points | by moondistance 14 hours ago ago
5 comments
not "an H200", "In the table above, tensor parallelism is compared to pipeline parallelism with each across eight GPUs"
Title on HN is wrong. The article says GPUs and it's referring to one of their 8xH200 boxes.
Significant further optimizations. FP8!
And this is why nobody submits MLPerf against NVIDIA.
Its weird, i looked up whether AMD has any benchmarks on the 405B for the MI300x, and came across this one -- https://dstack.ai/blog/amd-mi300x-inference-benchmark/#token...
From my understanding, it can get up to around 2500 tokens/s? Both are 8x units (h200 and MI300x)
not "an H200", "In the table above, tensor parallelism is compared to pipeline parallelism with each across eight GPUs"
Title on HN is wrong. The article says GPUs and it's referring to one of their 8xH200 boxes.
Significant further optimizations. FP8!
And this is why nobody submits MLPerf against NVIDIA.
Its weird, i looked up whether AMD has any benchmarks on the 405B for the MI300x, and came across this one -- https://dstack.ai/blog/amd-mi300x-inference-benchmark/#token...
From my understanding, it can get up to around 2500 tokens/s? Both are 8x units (h200 and MI300x)