Decoupling Compute and Memory for Async GPUs

7 points | by yiyingzhang 7 hours ago ago

2 comments

bobbyzhu2008 7 hours ago ago
67% less kernel code is the more interesting number here — Hopper's async capabilities have been underutilized largely because the programming model is painful. Curious how it handles cases where compute and memory phases aren't cleanly separable.
jhap 5 hours ago ago
This seems like a better version of CUDA, for Hopper GPUs?