VLA (Vision-Language-Action Model)

A Vision-Language-Action model is a neural network that jointly processes visual observations (RGB images), natural language instructions, and robot proprioception to produce action outputs. VLAs extend large vision-language models (VLMs such as PaLM-E, LLaVA, or Gemini) by adding an action head — training the model to output robot joint positions or end-effector deltas alongside its language predictions. Notable VLAs include RT-2 (tokenizes actions as text tokens and fine-tunes a VLM), OpenVLA (open-source, 7B parameter, trained on Open X-Embodiment), and pi0 (Physical Intelligence's flow-matching VLA). See the VLA and VLM article and the SVRC model catalog.
Foundation ModelLanguageCore Concept

Explore More Terms

Browse the full robotics glossary with 70+ terms.

Back to Glossary