VLA Models Compared: RT-2, OpenVLA, Pi0, SmolVLA, RoboFlamingo (2025)
A practical comparison of the most important Vision-Language-Action models — what they are, how they differ, and how to choose between them.
VLA models ground language instructions in visual perception to produce robot actions
What Is a VLA Model?
A Vision-Language-Action (VLA) model is a neural network that takes visual observations (camera images) and a natural language instruction as input, and produces robot action predictions as output. The key distinction from earlier robot learning models is the language conditioning: the same model can follow different instructions without retraining, enabling task generalization that pure imitation learning policies cannot achieve.
The typical VLA architecture chains three components:
- Vision encoder: Processes raw pixel observations into a feature representation (commonly a CLIP ViT or SigLIP)
- Language model backbone: A pre-trained LLM (Llama, Gemma, PaLM, Flamingo-style) that receives visual tokens alongside language tokens and reasons jointly over both
- Action head: A lightweight decoder that maps LLM output tokens to robot action predictions — either discretized action tokens (like RT-2) or continuous action vectors (like OpenVLA, Pi0)
The premise is that internet-scale pre-training on vision-language data gives the LLM backbone grounded world knowledge that transfers to robot manipulation with far less robot-specific data than training from scratch.
The Five Models: What You Need to Know
RT-2 (Google DeepMind, 2023)
RT-2 (Robotic Transformer 2) is a VLA built by Google DeepMind on top of PaLI-X (55B parameters) and PaLM-E (562B parameters). It fine-tunes a massive vision-language model end-to-end on robot demonstrations, representing actions as text tokens in the same vocabulary as the language model. The result is a model that can reason about novel instructions at inference time — asking it "move the rice chips to the dinosaur" with an unseen pairing works because the LLM understands the semantics.
RT-2 demonstrated emergent reasoning capabilities: chain-of-thought prompting improved task performance, and the model could answer factual questions mid-execution. However, it is closed-source — Google has not released weights or training code. You cannot fine-tune RT-2 on your own robot. It is a benchmark reference and a proof of concept for the VLA paradigm, not a deployable tool for most labs.
OpenVLA (Stanford + UC Berkeley, 2024)
OpenVLA is the first fully open-source, large-scale VLA model. It is built on Prismatic VLM (7B parameters, based on Llama 2 + SigLIP vision encoder) and fine-tuned on 970k robot demonstrations from Open X-Embodiment. OpenVLA discretizes actions into 256 bins per dimension and predicts them as language tokens, following RT-2's formulation.
The full training recipe, model weights, and fine-tuning code are publicly available on HuggingFace. Fine-tuning OpenVLA on a new task typically requires 50–200 demonstrations and takes ~6–12 hours on a single A100. The model runs inference at ~6 Hz on an A100 (it is not fast — the LLM forward pass is the bottleneck). OpenVLA-OFT (2024) adds LoRA fine-tuning support and reduces fine-tuning time by 2–4x.
Pi0 (Physical Intelligence, 2024)
Pi0 is a VLA from Physical Intelligence (pi.ai), the startup founded by Sergey Levine, Chelsea Finn, and others from the Berkeley robotics community. Pi0 uses a flow matching action head rather than discrete action tokens — this enables smoother, more dexterous trajectories and higher-frequency control than token-based VLAs. The base model is built on PaliGemma (3B parameters) with a diffusion-inspired action generation head.
Pi0 is notable for demonstrating real-world dexterous manipulation at a level not previously shown by open-weight VLAs: folding laundry, cleaning up cluttered tables, assembling objects. The weights are partially open — the base Pi0 weights are available for research use, but the fine-tuning pipeline and production-scale training data remain proprietary. Physical Intelligence offers commercial fine-tuning services.
SmolVLA (HuggingFace, 2025)
SmolVLA is HuggingFace's entry into the VLA space — a deliberately small, efficient VLA designed to run on consumer hardware. SmolVLA uses SmolLM2 (135M–1.7B parameters) as the language backbone, paired with SigLIP-400M for vision. The total model is 450M–2B parameters depending on configuration — roughly 3–15x smaller than OpenVLA.
SmolVLA runs inference at ~30 Hz on an RTX 3090, making it the first VLA genuinely practical for real-time robot control without enterprise GPUs. It is trained on LeRobot datasets and integrates natively with the LeRobot training stack. Performance lags OpenVLA on complex open-vocabulary tasks but matches or exceeds it on narrow task domains after fine-tuning. For labs with limited GPU budgets, SmolVLA is the most accessible starting point.
RoboFlamingo (Beijing Institute of Technology + Microsoft Research, 2023)
RoboFlamingo fine-tunes the OpenFlamingo vision-language model (9B parameters, based on MPT-7B + CLIP ViT-L) for robot manipulation. Unlike RT-2 and OpenVLA, RoboFlamingo is decoder-only — it uses the language model autoregressively for both vision understanding and action generation without a separate action head. This makes the architecture very flexible but also slower at inference (~3–5 Hz on an A100).
RoboFlamingo is primarily a research contribution rather than a production-ready model. Its key insight is that OpenFlamingo's in-context learning capability (feeding a few demonstration examples directly in the context window) transfers to robot manipulation: you can provide 1–4 demonstration trajectories as context and the model adapts without gradient updates. This "in-context robot learning" is genuinely novel and still underexplored in follow-up work.
At-a-Glance Comparison Table
| Model | Params | Base LLM | Training Data | Inference Hz | Open Source? | Fine-tune? |
|---|---|---|---|---|---|---|
| RT-2 | 55B / 562B | PaLI-X / PaLM-E | RT-1 dataset + web | 1–3 Hz | No (Google internal) | No |
| OpenVLA | 7B | Llama 2 + SigLIP | Open X-Embodiment (970k) | 6 Hz (A100) | Yes (weights + code) | Yes (LoRA / full) |
| Pi0 | ~3B | PaliGemma | pi.ai proprietary | 10–25 Hz | Weights (research) | Via pi.ai (commercial) |
| SmolVLA | 450M–2B | SmolLM2 + SigLIP | LeRobot datasets | 30 Hz (RTX 3090) | Yes (fully open) | Yes (LeRobot native) |
| RoboFlamingo | 9B | OpenFlamingo (MPT-7B) | Open X-Embodiment | 3–5 Hz (A100) | Yes (weights + code) | Yes (+ in-context) |
Hardware Requirements for Each Model
| Model | Min GPU (inference) | Recommended (inference) | Fine-tuning GPU | VRAM (inference) |
|---|---|---|---|---|
| RT-2 | N/A (not available) | TPU pod (Google) | N/A | N/A |
| OpenVLA | RTX 3090 (slow) | A100 40GB | A100 80GB (full) / RTX 4090 (LoRA) | ~16GB (BF16) |
| Pi0 | RTX 3090 | RTX 4090 / A100 | A100 40GB | ~8GB (BF16) |
| SmolVLA | RTX 3080 10GB | RTX 3090 | RTX 3090 / RTX 4090 | 2–6GB |
| RoboFlamingo | A100 40GB | A100 80GB | 2× A100 80GB | ~22GB (BF16) |
Action Representations: A Critical Difference
How each model outputs actions determines what tasks it can do well and how fast it can run:
- Discrete action tokens (RT-2, OpenVLA, RoboFlamingo): Actions are quantized into bins (typically 256 per dimension) and predicted as text tokens by the LLM. Enables leveraging the full LLM autoregressive machinery and in-context learning. Drawback: quantization error limits precision on fine manipulation; control frequency capped by LLM inference speed.
- Continuous action head with flow matching (Pi0): Actions are predicted as continuous vectors via a diffusion-inspired flow matching process. Enables smooth, precise trajectories and dexterous control. Higher frequency than token-based models at equivalent parameter count.
- Hybrid token + continuous (SmolVLA): Language tokens for reasoning; continuous output head for action. Balances language generalization with control precision.
When to Fine-Tune vs Train From Scratch
For virtually all teams, fine-tuning a pre-trained VLA is the right starting point. Training a VLA from scratch requires millions of robot demonstrations and massive compute — resources that only Google, DeepMind, and Physical Intelligence can marshal. The practical question is which pre-trained model to fine-tune and how.
Fine-tune when:
- Your target tasks are similar in style to the pre-training data (tabletop manipulation, pick-and-place, kitchen tasks)
- You have 50–500 demonstrations of the target task
- Your robot morphology is represented in Open X-Embodiment or LeRobot datasets
- You need reasonable zero-shot generalization within a task category
Train from scratch (or use non-VLA policy) when:
- Your task domain is radically different from pre-training (underwater manipulation, surgical robotics, industrial assembly)
- Your action space is non-standard (e.g., hydraulics, cable-driven, >7 DoF hands)
- You need maximum inference speed and can afford to limit task generality — in this case, ACT or Diffusion Policy trained from scratch will outperform a fine-tuned VLA on a specific narrow task
- Interpretability or safety constraints make large LLM backbones problematic
Fine-Tuning OpenVLA: Practical Steps
OpenVLA is the most accessible starting point for labs that want to run their own fine-tuning. The following is a minimal workflow:
# Install OpenVLA dependencies
pip install openvla
# Fine-tune with LoRA (requires single RTX 4090)
python experiments/robot/finetune.py \
--pretrained_checkpoint openvla/openvla-7b \
--data_root_dir /path/to/your/lerobot/dataset \
--dataset_name your_task_name \
--run_root_dir ./runs \
--use_lora True \
--lora_rank 32 \
--batch_size 16 \
--num_steps 10000 \
--learning_rate 2e-4
Expected fine-tuning time with LoRA on RTX 4090: ~6 hours for 10k steps with 100 demonstrations. Full fine-tuning requires an A100 80GB and takes ~18–24 hours.
Fine-Tuning SmolVLA: Practical Steps
SmolVLA integrates with LeRobot directly, making it the fastest path to a running VLA on your robot:
python lerobot/scripts/train.py \
policy=smolvla \
env=aloha \
dataset_repo_id=your-username/your-dataset \
hydra.run.dir=outputs/train/smolvla_task \
training.batch_size=32 \
training.num_epochs=100 \
policy.pretrained_backbone_kwargs.pretrained=true
Which Model Should You Use?
Start with SmolVLA if:
- Your lab has consumer GPUs (RTX 3090 or similar)
- You are already using the LeRobot ecosystem
- You want the fastest iteration cycle from demo collection to robot inference
- Your task is relatively narrow (one table, consistent lighting, known objects)
Use OpenVLA if:
- You need the best available open-source generalization performance
- You have access to an A100 or H100
- Your task involves novel object categories or instruction phrasings not seen during training
- You want published benchmark results to compare against
Consider Pi0 if:
- You need dexterous manipulation quality (multi-finger, contact-rich)
- You are open to working with Physical Intelligence for fine-tuning
- Your task requires higher control frequency than OpenVLA can provide
Use RoboFlamingo if:
- You want in-context robot learning (adapting to new tasks with demonstration examples in the context window, no gradient updates)
- You are doing research on few-shot generalization rather than optimizing task performance
Reference RT-2 when:
- Establishing theoretical upper bounds for VLA performance on a benchmark
- Writing related work — it is the canonical VLA reference
- But do not plan your system around it — it is not deployable
Data Requirements for Successful Fine-Tuning
Regardless of which model you choose, demonstration quality is the most important variable. VLAs fine-tuned on noisy, inconsistent, or poorly-formatted data will underperform smaller non-VLA policies trained on clean data. Key requirements:
- Camera placement consistency: The VLA was pre-trained on demonstrations with specific camera configurations. Match as closely as possible, or include camera-pose augmentation in fine-tuning.
- Episode format: HDF5 (RLDS-compatible) or LeRobot native format. OpenVLA requires RLDS; SmolVLA prefers LeRobot. Convert before training, not during.
- Language annotations: Each episode needs a natural language instruction string. For narrow tasks, a single fixed instruction per task works. For generalization, vary phrasing during collection.
- Minimum demo count: SmolVLA: 50+; OpenVLA: 100+; Pi0: 200+ for new task types.
We offer data collection services — teleoperation on ALOHA, OpenArm, and DK1 hardware, with output in LeRobot and RLDS formats — for teams that want to fine-tune a VLA without building the full collection pipeline themselves.