2 comments

  • DARSHANFOFADIYA 7 hours ago ago

    I've been working on optimizing training for long-context models (70B+) and found that while Tensor Parallelism is well-documented, the newer "Unified" Sequence Parallelism techniques (like DeepSpeed Ulysses) are often treated as black boxes.

    I wrote this deep dive to visualize exactly how we shard the Q, K, V projections and how the All-to-All communication primitives work during the attention step to handle 1M+ tokens.

    The post covers:

    The architectural difference between Ring Attention and Ulysses (and why Ulysses often wins on H100 clusters).

    Diagrams of the specific "All-to-All" communication steps.

    How to handle the KV-cache bottleneck without exploding memory.

    Happy to answer questions about the implementation or the communication cost analysis!

  • ClaireGz 7 hours ago ago

    This is super helpful — most writeups skip over the actual communication steps, so seeing the All-to-All flow laid out makes it much clearer.

    Curious from your experiments: at 1M+ context, does communication start dominating vs compute?

    I keep seeing cases where bigger context windows are technically possible but don’t translate into better results unless the context is very structured, so I wonder where the real scaling limit ends up being in practice.