Definition
Transformer-based policies apply attention mechanisms to robot control. They can process heterogeneous inputs — images, proprioception, language instructions, and action history — through a unified sequence model. Key architectures include RT-1 (tokenized actions with EfficientNet vision), RT-2 (VLM backbone), ACT (action chunking transformer for bimanual control), and Octo (scalable cross-embodiment transformer). Transformers naturally handle variable-length contexts and multi-task conditioning. Their main challenge in robotics is inference latency — real-time control at 10–50 Hz requires efficient model designs or action chunking to amortize compute.
Why It Matters for Robot Teams
Understanding transformer policy is essential for teams building real-world robot systems. Whether you are collecting demonstration data, training policies in simulation, or deploying in production, this concept directly affects your workflow and system design.