VLA Models Compared: OpenVLA vs π0 vs SmolVLA vs RT-2 (2026 Guide)
Vision-Language-Action models are the foundation models of robotics — neural networks that take camera images and language instructions as input and output robot motor commands. This guide compares every major VLA model available in 2026, covering architecture, performance, data requirements, and practical fine-tuning guidance.
What Is a VLA Model?
A Vision-Language-Action (VLA) model is a neural network that maps visual observations and language instructions to robot actions. The core idea: take a pre-trained vision-language model (which already understands images and text) and add an action output head that produces motor commands for a robot. The vision-language backbone provides semantic understanding ("pick up the red cup"), while the action head translates that understanding into physical movements (a sequence of joint angles or end-effector positions).
Architecture Overview
All VLA models share three components, though their implementations differ significantly:
- Vision encoder: Converts camera images into feature vectors. Common choices are Vision Transformers (ViT) or SigLIP. The vision encoder is typically pre-trained on internet-scale image datasets (ImageNet, LAION) and frozen or lightly fine-tuned during VLA training.
- Language backbone: Processes text instructions and integrates them with visual features. Most VLAs use a pre-trained large language model (LLaMA, Gemma, PaLI) as the backbone, leveraging its reasoning and instruction-following capabilities.
- Action head: Generates robot actions from the fused vision-language representation. This is where VLA architectures diverge most significantly.
Action Representation: The Key Differentiator
How a VLA represents and generates actions determines its performance characteristics:
- Discrete tokens (RT-2, OpenVLA): Actions are quantized into discrete tokens and generated autoregressively, like text tokens. Simple to implement but produces jerky single-step actions.
- Chunked prediction (SmolVLA): The model predicts a sequence of future actions at once (typically 10–50 steps), inspired by ACT. Produces smoother trajectories and is more compute-efficient at inference time.
- Flow matching (π0): Actions are generated by iteratively denoising a random sample into a valid action trajectory. Produces the smoothest actions and handles multi-modal distributions (tasks with multiple valid solutions) best, but requires multiple denoising steps at inference time.
- Diffusion (Octo): Similar to flow matching but uses a DDPM-style diffusion process. Good for multi-modal actions but slower at inference than chunked prediction.
VLA Model Comparison Table
| Model | Creator | Open Source | Parameters | Action Type | Hardware Compatibility |
|---|---|---|---|---|---|
| RT-2 | Google DeepMind | No | 55B | Discrete tokens | Google RT-X robots |
| OpenVLA | Stanford / Berkeley | Yes (Apache 2.0) | 7.5B | Discrete tokens | Any |
| π0 | Physical Intelligence | No | ~3B (est.) | Flow matching | Any |
| SmolVLA | HuggingFace | Yes (Apache 2.0) | 450M | Chunked (ACT-style) | Any |
| Octo | UC Berkeley | Yes (MIT) | ~93M | Diffusion | Any |
| RoboFlamingo | Shanghai AI Lab | Yes | ~9B | Autoregressive | Any |
| Helix | Figure AI | No | Undisclosed | Undisclosed | Figure 02 |
Deep Dive: Each VLA Model
RT-2 (Google DeepMind)
Architecture: RT-2 is built on PaLI-X (55B parameters) or PaLM-E (12B parameters). The vision encoder is a ViT-e (4B parameters), and the language model processes both visual tokens and text tokens. Actions are represented as text strings of discretized joint positions (e.g., "1 128 91 241 5 101 127") and generated autoregressively by the language model.
Training data: Trained on Google's RT-X dataset containing approximately 130,000 demonstrations from 13 robot types. Additionally leverages internet-scale vision-language pre-training from the PaLI-X backbone.
Performance: RT-2 demonstrated emergent capabilities — following instructions involving concepts never seen during robot training (e.g., "move the object to the Taylor Swift album"). On standard benchmarks, it achieved 62% success on novel semantic concepts, compared to 32% for RT-1.
Fine-tuning requirements: Not publicly available. Google has not released weights or training code.
Pros: Strongest emergent semantic generalization; first proof that VLA concept works at scale.
Cons: Closed source; requires massive compute; 55B parameters make real-time inference challenging; single-step discrete actions produce jerky motion.
OpenVLA (Stanford / Berkeley)
Architecture: Built on the Prismatic VLM backbone (Llama-2 7B + SigLIP + DinoV2 vision encoders). Actions are discretized into 256 bins per dimension and predicted autoregressively as tokens. The dual vision encoder (SigLIP for semantic features + DinoV2 for spatial features) gives OpenVLA stronger visual grounding than single-encoder designs.
Training data: Trained on the Open X-Embodiment dataset (~970K trajectories from 22 robot embodiments). Fine-tuning requires as few as 100–200 demonstrations for a new task on a specific robot.
Performance: On the LIBERO benchmark, fine-tuned OpenVLA achieves 85–95% success depending on task difficulty. On real-robot evaluations with WidowX and Franka arms, it matches or exceeds task-specific ACT policies on seen tasks and significantly outperforms them on unseen task variations.
Fine-tuning requirements: 2x A100 (80 GB) for full fine-tuning; 1x A100 for LoRA fine-tuning. Training time: 8–24 hours for 200 demonstrations. LoRA reduces memory to ~40 GB at a small performance cost.
Pros: Fully open source with excellent documentation; strongest open-source baseline; large community and active development; works on any robot with simple action space mapping.
Cons: 7.5B parameters means ~300ms per action step on consumer GPUs (too slow for reactive tasks); discrete tokens produce less smooth actions than flow matching or chunked prediction; requires significant VRAM for fine-tuning.
π0 (Physical Intelligence)
Architecture: π0 uses a vision-language backbone (likely PaLI-based, details not fully disclosed) with a flow matching action head. Flow matching generates actions by iteratively transforming a noise sample into an action trajectory through a learned velocity field. This produces smooth, continuous trajectories without the discretization artifacts of token-based approaches.
Training data: Trained on a proprietary dataset of thousands of hours of teleoperation data across multiple robot platforms (Franka, UR5, mobile manipulators, dexterous hands). The dataset is significantly larger and more diverse than Open X-Embodiment.
Performance: π0 demonstrated the broadest task generalization of any robot policy, performing laundry folding, table bussing, box packing, and assembly tasks across different robot embodiments from a single model. Quantitative benchmarks show 80–95% success on trained tasks and 40–60% zero-shot on related but unseen tasks.
Fine-tuning requirements: Not publicly available. Physical Intelligence offers API access for select partners.
Pros: Smoothest action generation; broadest task generalization demonstrated; handles contact-rich manipulation better than token-based VLAs; multi-embodiment support.
Cons: Closed source; no public weights; inference requires multiple denoising steps (10–20), adding latency; commercial access only through Physical Intelligence partnership.
SmolVLA (HuggingFace)
Architecture: SmolVLA uses SmolVLM (a compact vision-language model based on Idefics-3) as its backbone with a chunked action prediction head. The action head predicts a sequence of future actions (action chunk) in a single forward pass, inspired by ACT's chunked prediction. The vision encoder is a SigLIP-400M model, and the language backbone is a 500M parameter transformer.
Training data: Trained on a curated subset of Open X-Embodiment plus LeRobot community datasets. Fine-tuning requires 50–200 demonstrations.
Performance: On the LIBERO benchmark, SmolVLA achieves 82–90% success — within 5% of OpenVLA despite having 16x fewer parameters. On real-robot evaluations, fine-tuned SmolVLA matches OpenVLA on simple tasks and slightly underperforms on complex multi-step tasks. The key advantage is inference speed: SmolVLA runs at 15–30 Hz on a single RTX 4090, fast enough for reactive manipulation.
Fine-tuning requirements: 1x RTX 4090 (24 GB) for full fine-tuning; training time 4–12 hours for 200 demonstrations. The low resource requirement makes SmolVLA accessible to any lab with a modern consumer GPU.
Pros: Smallest and fastest open-source VLA; runs on consumer hardware; chunked prediction produces smooth actions; full training code available on HuggingFace; active community.
Cons: Smaller backbone limits language understanding compared to 7B+ models; less robust to novel instructions; chunked prediction can struggle with very long-horizon tasks.
Octo (UC Berkeley)
Architecture: Octo uses a transformer backbone with a diffusion action head. Unlike the VLA models above, Octo does not use a pre-trained language model backbone. Instead, it uses a smaller transformer trained from scratch on robot data, with optional language conditioning. The diffusion head generates action trajectories through iterative denoising.
Training data: Trained on the Open X-Embodiment dataset (~800K trajectories). Designed specifically for cross-embodiment transfer and rapid fine-tuning.
Performance: On the SIMPLER benchmark, Octo-Base achieves 50–75% success across multiple simulated environments. Its strength is adaptation speed: fine-tuning on 20–50 demonstrations of a new task takes 30 minutes to 2 hours on a single GPU.
Fine-tuning requirements: 1x RTX 3090 or better; training time 30 minutes to 2 hours. The lowest compute requirement of any model in this comparison.
Pros: Fastest fine-tuning; smallest compute requirement; diffusion head handles multi-modal actions well; designed for rapid prototyping.
Cons: Weakest language understanding (no LLM backbone); lower absolute performance than OpenVLA or SmolVLA on complex tasks; diffusion inference is slower than chunked prediction.
RoboFlamingo (Shanghai AI Lab)
Architecture: Built on OpenFlamingo (a multi-modal model based on MPT-7B), RoboFlamingo adds a robot action prediction head to a pre-trained vision-language model. It uses perceiver resampler modules to efficiently process multi-frame visual histories, enabling the model to reason about temporal dynamics.
Training data: Trained on a combination of CALVIN benchmark data and custom real-robot demonstrations. Can be fine-tuned on task-specific datasets of 100–500 demonstrations.
Performance: On the CALVIN benchmark, RoboFlamingo achieves state-of-the-art results on multi-step long-horizon tasks. Its temporal visual reasoning (processing sequences of frames rather than single frames) gives it an advantage on tasks requiring memory of previous steps.
Fine-tuning requirements: 2x A100 (80 GB) for full fine-tuning due to the 9B parameter backbone.
Pros: Strong long-horizon task performance; temporal visual reasoning; open source.
Cons: Large model size (9B); limited real-robot evaluations published; smaller community than OpenVLA or SmolVLA; autoregressive action generation produces single-step actions.
How to Choose: VLA Decision Framework
Selecting the right VLA model depends on three factors: your compute budget, your data availability, and your task requirements.
By Compute Budget
| Budget | Fine-Tuning Hardware | Recommended Model | Inference Speed |
|---|---|---|---|
| Consumer ($0–$2K GPU) | 1x RTX 4090 | SmolVLA or Octo | 15–30 Hz |
| Research lab ($5K–$20K) | 1–2x A100 | OpenVLA (LoRA) or SmolVLA | 3–15 Hz |
| Well-funded lab ($20K+) | 4+ A100 or H100 | OpenVLA (full) or custom VLA | 3–10 Hz |
| Enterprise | Cloud cluster | π0 (API) or custom training | Varies |
By Task Type
| Task | Key Requirement | Best Model | Why |
|---|---|---|---|
| Simple pick-and-place | Speed, low compute | SmolVLA or Octo | Fast inference, minimal data needed |
| Language-conditioned manipulation | Semantic understanding | OpenVLA | Strongest language backbone among open models |
| Contact-rich manipulation | Smooth, precise actions | π0 or SmolVLA | Flow matching / chunked actions handle contact better |
| Multi-step long-horizon | Temporal reasoning | RoboFlamingo or OpenVLA | Multi-frame processing, strong LLM backbone |
| Rapid prototyping | Fast iteration | Octo | 30-min fine-tuning on 20 demos |
| Bimanual coordination | Multi-arm action space | SmolVLA or Octo | Flexible action spaces, chunked prediction |
By Data Availability
- 20–50 demonstrations: Octo (designed for rapid fine-tuning with minimal data)
- 50–200 demonstrations: SmolVLA (good performance with moderate data) or ACT (non-VLA baseline, very effective with limited data)
- 200–500 demonstrations: OpenVLA (full fine-tuning achieves best results at this scale)
- 500+ demonstrations: Any model benefits; consider OpenVLA full fine-tuning or training a custom model
Fine-Tuning a VLA on Your Robot Data
Data Format Requirements
All major VLA models expect training data in one of two formats:
- RLDS (Reinforcement Learning Datasets): The standard for Open X-Embodiment models. Used by Octo, OpenVLA, and RT-2. Each episode is stored as a sequence of (observation, action, reward) tuples in TFRecord format.
- LeRobot format: HuggingFace's format for SmolVLA and the LeRobot training framework. Each episode is stored as a Parquet file with synchronized image and action columns, plus a metadata JSON file.
SVRC's data collection pipeline outputs both formats. See our data format guide for conversion details.
GPU Requirements and Training Time
| Model | Method | Min GPU | VRAM | Time (200 demos) |
|---|---|---|---|---|
| SmolVLA | Full fine-tune | 1x RTX 4090 | 24 GB | 4–12 hours |
| OpenVLA | LoRA | 1x A100 | 40 GB | 8–16 hours |
| OpenVLA | Full fine-tune | 2x A100 | 80 GB each | 12–24 hours |
| Octo | Full fine-tune | 1x RTX 3090 | 24 GB | 0.5–2 hours |
| RoboFlamingo | Full fine-tune | 2x A100 | 80 GB each | 16–32 hours |
LoRA vs Full Fine-Tuning
Low-Rank Adaptation (LoRA) freezes the pre-trained weights and trains small adapter matrices, reducing memory requirements by 50–70%. For OpenVLA:
- Full fine-tuning: Updates all 7.5B parameters. Requires 2x A100 (80 GB). Achieves best performance, especially when your task distribution differs significantly from the pre-training data.
- LoRA fine-tuning: Trains ~50M adapter parameters (rank 32). Requires 1x A100 (40 GB). Performance within 2–5% of full fine-tuning for most tasks. Recommended as the default approach unless you have excess compute.
Example: Fine-Tuning SmolVLA with LeRobot
# Install LeRobot
# pip install lerobot
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
from lerobot.common.policies.smolvla.configuration_smolvla import SmolVLAConfig
from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy
# Load your dataset (collected at SVRC or locally)
dataset = LeRobotDataset("your_org/your_dataset")
# Configure the policy
config = SmolVLAConfig(
input_shapes={
"observation.images.top": [3, 480, 640],
"observation.images.wrist": [3, 480, 640],
"observation.state": [7], # 6 joints + gripper
},
output_shapes={
"action": [7],
},
action_chunk_size=50,
)
# Initialize from pre-trained weights
policy = SmolVLAPolicy(config, dataset_stats=dataset.stats)
policy.load_pretrained("HuggingFaceTB/SmolVLA-base")
# Fine-tune (see LeRobot docs for full training loop)
# python lerobot/scripts/train.py \
# --policy.type=smolvla \
# --dataset.repo_id=your_org/your_dataset \
# --training.num_epochs=100
SVRC Data Collection to Fine-Tuning Pipeline
The typical workflow for teams using SVRC's data collection services:
- Scope the task: Define manipulation tasks, object variations, success criteria. SVRC's solutions team helps determine the minimum dataset size (typically 100–200 demos for VLA fine-tuning).
- Collect demonstrations: SVRC operators collect expert demonstrations on OpenArm 101 or DK1 with synchronized multi-camera recording. Pilot campaign: 100 demos ($2,500). Full campaign: 500 demos ($8,000).
- Receive formatted data: Data is delivered in both HDF5 (ACT/Diffusion Policy compatible) and LeRobot format (SmolVLA/OpenVLA compatible).
- Fine-tune your VLA: Use the delivered dataset to fine-tune SmolVLA (consumer GPU) or OpenVLA (A100). SVRC can provide compute access or guidance on cloud GPU setup.
- Evaluate and iterate: Deploy the fine-tuned policy on your robot. If success rates are below target, collect additional targeted demonstrations for failure cases.
Real-World Benchmarks: Accuracy vs Inference Speed
| Model | LIBERO-Long (5 tasks) | SIMPLER (avg) | Inference (RTX 4090) | Inference (A100) |
|---|---|---|---|---|
| RT-2 (55B) | N/A (closed) | N/A (closed) | N/A | ~1 Hz |
| OpenVLA (7.5B) | 90% | 72% | ~3 Hz | ~8 Hz |
| π0 | N/A (closed) | N/A (closed) | N/A | ~5 Hz (est.) |
| SmolVLA (450M) | 86% | 68% | ~20 Hz | ~30 Hz |
| Octo (93M) | 72% | 58% | ~10 Hz | ~25 Hz |
| RoboFlamingo (9B) | 82% (CALVIN) | N/A | ~2 Hz | ~6 Hz |
Important caveat: Benchmark numbers are measured on specific datasets and evaluation protocols. Real-world performance depends on your specific task, environment, and robot. A model that achieves 90% on LIBERO may achieve 60% on your task if the visual environment or object set differs significantly. Always evaluate on your target setup.