Multi-head attention runs multiple attention mechanisms in parallel, allowing the model to attend to different representation subspaces simultaneously. This captures various types of relationships in the data.
MultiHead(Q,K,V) = Concat(head_1, ..., head_h)W^O
multi-head parallel-attention representation