Custom FP4 CUDA Kernel – 129 Tflops on DGX Spark with Pre-Quantized Weight Cache

(forums.developer.nvidia.com)

1 points | by vkaufmann 2 hours ago ago

1 comments

vkaufmann 2 hours ago ago
I went all in and wrote a custom FP4 GEMM kernel on top of CUTLASS 3.8. Along the way I discovered FP4 doesn’t actually help training - no backward pass. But what came out of it is something I haven’t seen anywhere else for consumer Blackwell: a standalone FP4 GEMM library with a pre-quantized weight cache that hits 85-129 TFLOPS on the Spark.
Quantize weights once at model load, only quantize activations on the fly per call. Integrated into a full transformer (GPT-OSS-4.2B, 24 layers, 288 GEMM calls per forward pass), it runs 1.3-2.3x faster than BF16 at inference-relevant batch sizes with 4x memory savings. Tested on both 4.2B and 20B models - the 20B drops from 43.4 GB to 4.0 GB with FP4 weights (10.8x compression). No dependency on vLLM, TRT-LLM, or sglang - just a library you can call from any Python code.
Full source is open: GitHub - VincentKaufmann/fp4-cuda-kernel: Custom FP4 GEMM kernel for DGX Spark / RTX 50 Series (SM120/SM121). 143 TFLOPS, 5-9x faster than BF16. Built on CUTLASS 3.8.
Why This Library Exists
No existing path gives you hardware FP4 on SM121 as a standalone library.
Find the complete post here:https://forums.developer.nvidia.com/t/custom-fp4-cuda-kernel...
Repo: https://github.com/VincentKaufmann/fp4-cuda-kernel