This post explains how tensor model parallelism and sequence parallelism work especially on attention layer, and how they are different. Backgrounds # Attention Layer # Attention calculation with a single sequence with T number of tokens. d_attn is config.embed_dim // config.num_attention_heads Bold boxes are model parameters, while others are temporarily created tensors. All the other terms are borrowed from HuggingFace Transformers OPT config. Implementation of computing attention: