VLA
Vision-Language-Action model — a foundation model that takes visual observations and language instructions as input and directly outputs robot actions. VLAs unify perception, language understanding, and control in a single neural network. Examples include RT-2, OpenVLA, Octo, and π0. VLAs represent the current frontier of generalist robot learning, aiming to be the 'GPT moment' for robotics.