What Is a VLA Model?
Vision-Language-Action (VLA) models take visual observations and language instructions as input and directly output robot actions. They combine the visual understanding of vision-language models (VLMs) with motor control capabilities trained on robot demonstration data. Think of them as foundation models for robot control.
Key VLA Models Compared
RT-2 (Google DeepMind): 55B parameters, strong generalization, not publicly available. OpenVLA (Stanford/Berkeley): 7B parameters, open-source, fine-tunable on custom data. Octo (Berkeley): 93M parameters, fast inference, supports multiple robot embodiments. π₀ (Physical Intelligence): diffusion-based VLA, strong dexterous manipulation.
- For research with limited compute: Octo
- For fine-tuning on custom tasks: OpenVLA
- For highest capability: π₀ (if available)
Deployment Considerations
VLA models require GPU inference (typically RTX 3090 or better). Inference latency ranges from 50ms (Octo) to 500ms+ (OpenVLA 7B). Action chunking helps bridge the gap between slow inference and fast control loops. Fine-tuning on 50–200 task-specific demonstrations typically yields strong results. SVRC provides pre-configured workstations for VLA development.