In the previous analysis of sequence parallelism, I covered two papers 1 2. Those are early works about sequence parallelism and didn’t get attention as there was low demand for context parallelism. After LLMs are required to have longer context support, new papers that tackle the problems of such early works. What are the Problems of Early Sequence Parallelism Works? # Both works follow the traditional attention computation: compute $QK^T$, apply mask.