Multi-Head Attention
An attention mechanism that runs multiple attention operations in parallel, each with different learned projections, then concatenates their outputs. Multi-head attention allows the model to attend to different types of information (position, color, shape) simultaneously. It is the core computational primitive of transformer architectures used in VLAs and policy transformers.