VLA Models Compared: RT-2, OpenVLA, Pi0, SmolVLA, RoboFlamingo (2025)

A practical comparison of the most important Vision-Language-Action models — what they are, how they differ, and how to choose between them.

VLA models ground language instructions in visual perception to produce robot actions

Vision Language LLM Actions

What Is a VLA Model?

A Vision-Language-Action (VLA) model is a neural network that takes visual observations (camera images) and a natural language instruction as input, and produces robot action predictions as output. The key distinction from earlier robot learning models is the language conditioning: the same model can follow different instructions without retraining, enabling task generalization that pure imitation learning policies cannot achieve.

The typical VLA architecture chains three components:

Vision encoder: Processes raw pixel observations into a feature representation (commonly a CLIP ViT or SigLIP)
Language model backbone: A pre-trained LLM (Llama, Gemma, PaLM, Flamingo-style) that receives visual tokens alongside language tokens and reasons jointly over both
Action head: A lightweight decoder that maps LLM output tokens to robot action predictions — either discretized action tokens (like RT-2) or continuous action vectors (like OpenVLA, Pi0)

The premise is that internet-scale pre-training on vision-language data gives the LLM backbone grounded world knowledge that transfers to robot manipulation with far less robot-specific data than training from scratch.

The Five Models: What You Need to Know

RT-2 (Google DeepMind, 2023)

RT-2 (Robotic Transformer 2) is a VLA built by Google DeepMind on top of PaLI-X (55B parameters) and PaLM-E (562B parameters). It fine-tunes a massive vision-language model end-to-end on robot demonstrations, representing actions as text tokens in the same vocabulary as the language model. The result is a model that can reason about novel instructions at inference time — asking it "move the rice chips to the dinosaur" with an unseen pairing works because the LLM understands the semantics.

RT-2 demonstrated emergent reasoning capabilities: chain-of-thought prompting improved task performance, and the model could answer factual questions mid-execution. However, it is closed-source — Google has not released weights or training code. You cannot fine-tune RT-2 on your own robot. It is a benchmark reference and a proof of concept for the VLA paradigm, not a deployable tool for most labs.

OpenVLA (Stanford + UC Berkeley, 2024)

OpenVLA is the first fully open-source, large-scale VLA model. It is built on Prismatic VLM (7B parameters, based on Llama 2 + SigLIP vision encoder) and fine-tuned on 970k robot demonstrations from Open X-Embodiment. OpenVLA discretizes actions into 256 bins per dimension and predicts them as language tokens, following RT-2's formulation.

The full training recipe, model weights, and fine-tuning code are publicly available on HuggingFace. Fine-tuning OpenVLA on a new task typically requires 50–200 demonstrations and takes ~6–12 hours on a single A100. The model runs inference at ~6 Hz on an A100 (it is not fast — the LLM forward pass is the bottleneck). OpenVLA-OFT (2024) adds LoRA fine-tuning support and reduces fine-tuning time by 2–4x.

Pi0 (Physical Intelligence, 2024)

Pi0 is a VLA from Physical Intelligence (pi.ai), the startup founded by Sergey Levine, Chelsea Finn, and others from the Berkeley robotics community. Pi0 uses a flow matching action head rather than discrete action tokens — this enables smoother, more dexterous trajectories and higher-frequency control than token-based VLAs. The base model is built on PaliGemma (3B parameters) with a diffusion-inspired action generation head.

Pi0 is notable for demonstrating real-world dexterous manipulation at a level not previously shown by open-weight VLAs: folding laundry, cleaning up cluttered tables, assembling objects. The weights are partially open — the base Pi0 weights are available for research use, but the fine-tuning pipeline and production-scale training data remain proprietary. Physical Intelligence offers commercial fine-tuning services.

SmolVLA (HuggingFace, 2025)

SmolVLA is HuggingFace's entry into the VLA space — a deliberately small, efficient VLA designed to run on consumer hardware. SmolVLA uses SmolLM2 (135M–1.7B parameters) as the language backbone, paired with SigLIP-400M for vision. The total model is 450M–2B parameters depending on configuration — roughly 3–15x smaller than OpenVLA.

SmolVLA runs inference at ~30 Hz on an RTX 3090, making it the first VLA genuinely practical for real-time robot control without enterprise GPUs. It is trained on LeRobot datasets and integrates natively with the LeRobot training stack. Performance lags OpenVLA on complex open-vocabulary tasks but matches or exceeds it on narrow task domains after fine-tuning. For labs with limited GPU budgets, SmolVLA is the most accessible starting point.

RoboFlamingo (Beijing Institute of Technology + Microsoft Research, 2023)

RoboFlamingo fine-tunes the OpenFlamingo vision-language model (9B parameters, based on MPT-7B + CLIP ViT-L) for robot manipulation. Unlike RT-2 and OpenVLA, RoboFlamingo is decoder-only — it uses the language model autoregressively for both vision understanding and action generation without a separate action head. This makes the architecture very flexible but also slower at inference (~3–5 Hz on an A100).

RoboFlamingo is primarily a research contribution rather than a production-ready model. Its key insight is that OpenFlamingo's in-context learning capability (feeding a few demonstration examples directly in the context window) transfers to robot manipulation: you can provide 1–4 demonstration trajectories as context and the model adapts without gradient updates. This "in-context robot learning" is genuinely novel and still underexplored in follow-up work.

At-a-Glance Comparison Table

Model	Params	Base LLM	Training Data	Inference Hz	Open Source?	Fine-tune?
RT-2	55B / 562B	PaLI-X / PaLM-E	RT-1 dataset + web	1–3 Hz	No (Google internal)	No
OpenVLA	7B	Llama 2 + SigLIP	Open X-Embodiment (970k)	6 Hz (A100)	Yes (weights + code)	Yes (LoRA / full)
Pi0	~3B	PaliGemma	pi.ai proprietary	10–25 Hz	Weights (research)	Via pi.ai (commercial)
SmolVLA	450M–2B	SmolLM2 + SigLIP	LeRobot datasets	30 Hz (RTX 3090)	Yes (fully open)	Yes (LeRobot native)
RoboFlamingo	9B	OpenFlamingo (MPT-7B)	Open X-Embodiment	3–5 Hz (A100)	Yes (weights + code)	Yes (+ in-context)

Hardware Requirements for Each Model

Model	Min GPU (inference)	Recommended (inference)	Fine-tuning GPU	VRAM (inference)
RT-2	N/A (not available)	TPU pod (Google)	N/A	N/A
OpenVLA	RTX 3090 (slow)	A100 40GB	A100 80GB (full) / RTX 4090 (LoRA)	~16GB (BF16)
Pi0	RTX 3090	RTX 4090 / A100	A100 40GB	~8GB (BF16)
SmolVLA	RTX 3080 10GB	RTX 3090	RTX 3090 / RTX 4090	2–6GB
RoboFlamingo	A100 40GB	A100 80GB	2× A100 80GB	~22GB (BF16)

Action Representations: A Critical Difference

How each model outputs actions determines what tasks it can do well and how fast it can run:

Discrete action tokens (RT-2, OpenVLA, RoboFlamingo): Actions are quantized into bins (typically 256 per dimension) and predicted as text tokens by the LLM. Enables leveraging the full LLM autoregressive machinery and in-context learning. Drawback: quantization error limits precision on fine manipulation; control frequency capped by LLM inference speed.
Continuous action head with flow matching (Pi0): Actions are predicted as continuous vectors via a diffusion-inspired flow matching process. Enables smooth, precise trajectories and dexterous control. Higher frequency than token-based models at equivalent parameter count.
Hybrid token + continuous (SmolVLA): Language tokens for reasoning; continuous output head for action. Balances language generalization with control precision.

When to Fine-Tune vs Train From Scratch

For virtually all teams, fine-tuning a pre-trained VLA is the right starting point. Training a VLA from scratch requires millions of robot demonstrations and massive compute — resources that only Google, DeepMind, and Physical Intelligence can marshal. The practical question is which pre-trained model to fine-tune and how.

Fine-tune when:

Your target tasks are similar in style to the pre-training data (tabletop manipulation, pick-and-place, kitchen tasks)
You have 50–500 demonstrations of the target task
Your robot morphology is represented in Open X-Embodiment or LeRobot datasets
You need reasonable zero-shot generalization within a task category

Train from scratch (or use non-VLA policy) when:

Your task domain is radically different from pre-training (underwater manipulation, surgical robotics, industrial assembly)
Your action space is non-standard (e.g., hydraulics, cable-driven, >7 DoF hands)
You need maximum inference speed and can afford to limit task generality — in this case, ACT or Diffusion Policy trained from scratch will outperform a fine-tuned VLA on a specific narrow task
Interpretability or safety constraints make large LLM backbones problematic

Fine-Tuning OpenVLA: Practical Steps

OpenVLA is the most accessible starting point for labs that want to run their own fine-tuning. The following is a minimal workflow:

# Install OpenVLA dependencies
pip install openvla

# Fine-tune with LoRA (requires single RTX 4090)
python experiments/robot/finetune.py \
  --pretrained_checkpoint openvla/openvla-7b \
  --data_root_dir /path/to/your/lerobot/dataset \
  --dataset_name your_task_name \
  --run_root_dir ./runs \
  --use_lora True \
  --lora_rank 32 \
  --batch_size 16 \
  --num_steps 10000 \
  --learning_rate 2e-4

Expected fine-tuning time with LoRA on RTX 4090: ~6 hours for 10k steps with 100 demonstrations. Full fine-tuning requires an A100 80GB and takes ~18–24 hours.

Fine-Tuning SmolVLA: Practical Steps

SmolVLA integrates with LeRobot directly, making it the fastest path to a running VLA on your robot:

python lerobot/scripts/train.py \
  policy=smolvla \
  env=aloha \
  dataset_repo_id=your-username/your-dataset \
  hydra.run.dir=outputs/train/smolvla_task \
  training.batch_size=32 \
  training.num_epochs=100 \
  policy.pretrained_backbone_kwargs.pretrained=true

Which Model Should You Use?

Start with SmolVLA if:

Your lab has consumer GPUs (RTX 3090 or similar)
You are already using the LeRobot ecosystem
You want the fastest iteration cycle from demo collection to robot inference
Your task is relatively narrow (one table, consistent lighting, known objects)

Use OpenVLA if:

You need the best available open-source generalization performance
You have access to an A100 or H100
Your task involves novel object categories or instruction phrasings not seen during training
You want published benchmark results to compare against

Consider Pi0 if:

You need dexterous manipulation quality (multi-finger, contact-rich)
You are open to working with Physical Intelligence for fine-tuning
Your task requires higher control frequency than OpenVLA can provide

Use RoboFlamingo if:

You want in-context robot learning (adapting to new tasks with demonstration examples in the context window, no gradient updates)
You are doing research on few-shot generalization rather than optimizing task performance

Reference RT-2 when:

Establishing theoretical upper bounds for VLA performance on a benchmark
Writing related work — it is the canonical VLA reference
But do not plan your system around it — it is not deployable

Data Requirements for Successful Fine-Tuning

Regardless of which model you choose, demonstration quality is the most important variable. VLAs fine-tuned on noisy, inconsistent, or poorly-formatted data will underperform smaller non-VLA policies trained on clean data. Key requirements:

Camera placement consistency: The VLA was pre-trained on demonstrations with specific camera configurations. Match as closely as possible, or include camera-pose augmentation in fine-tuning.
Episode format: HDF5 (RLDS-compatible) or LeRobot native format. OpenVLA requires RLDS; SmolVLA prefers LeRobot. Convert before training, not during.
Language annotations: Each episode needs a natural language instruction string. For narrow tasks, a single fixed instruction per task works. For generalization, vary phrasing during collection.
Minimum demo count: SmolVLA: 50+; OpenVLA: 100+; Pi0: 200+ for new task types.

We offer data collection services — teleoperation on ALOHA, OpenArm, and DK1 hardware, with output in LeRobot and RLDS formats — for teams that want to fine-tune a VLA without building the full collection pipeline themselves.

All Models → ACT vs Diffusion Policy → OpenVLA vs Octo →