← Research

VLA Models Compared: RT-2, OpenVLA, Pi0, SmolVLA, RoboFlamingo (2025)

A practical comparison of the most important Vision-Language-Action models — what they are, how they differ, and how to choose between them.

VLA models ground language instructions in visual perception to produce robot actions

Vision Language LLM Actions

What Is a VLA Model?

A Vision-Language-Action (VLA) model is a neural network that takes visual observations (camera images) and a natural language instruction as input, and produces robot action predictions as output. The key distinction from earlier robot learning models is the language conditioning: the same model can follow different instructions without retraining, enabling task generalization that pure imitation learning policies cannot achieve.

The typical VLA architecture chains three components:

  1. Vision encoder: Processes raw pixel observations into a feature representation (commonly a CLIP ViT or SigLIP)
  2. Language model backbone: A pre-trained LLM (Llama, Gemma, PaLM, Flamingo-style) that receives visual tokens alongside language tokens and reasons jointly over both
  3. Action head: A lightweight decoder that maps LLM output tokens to robot action predictions — either discretized action tokens (like RT-2) or continuous action vectors (like OpenVLA, Pi0)

The premise is that internet-scale pre-training on vision-language data gives the LLM backbone grounded world knowledge that transfers to robot manipulation with far less robot-specific data than training from scratch.

The Five Models: What You Need to Know

RT-2 (Google DeepMind, 2023)

RT-2 (Robotic Transformer 2) is a VLA built by Google DeepMind on top of PaLI-X (55B parameters) and PaLM-E (562B parameters). It fine-tunes a massive vision-language model end-to-end on robot demonstrations, representing actions as text tokens in the same vocabulary as the language model. The result is a model that can reason about novel instructions at inference time — asking it "move the rice chips to the dinosaur" with an unseen pairing works because the LLM understands the semantics.

RT-2 demonstrated emergent reasoning capabilities: chain-of-thought prompting improved task performance, and the model could answer factual questions mid-execution. However, it is closed-source — Google has not released weights or training code. You cannot fine-tune RT-2 on your own robot. It is a benchmark reference and a proof of concept for the VLA paradigm, not a deployable tool for most labs.

OpenVLA (Stanford + UC Berkeley, 2024)

OpenVLA is the first fully open-source, large-scale VLA model. It is built on Prismatic VLM (7B parameters, based on Llama 2 + SigLIP vision encoder) and fine-tuned on 970k robot demonstrations from Open X-Embodiment. OpenVLA discretizes actions into 256 bins per dimension and predicts them as language tokens, following RT-2's formulation.

The full training recipe, model weights, and fine-tuning code are publicly available on HuggingFace. Fine-tuning OpenVLA on a new task typically requires 50–200 demonstrations and takes ~6–12 hours on a single A100. The model runs inference at ~6 Hz on an A100 (it is not fast — the LLM forward pass is the bottleneck). OpenVLA-OFT (2024) adds LoRA fine-tuning support and reduces fine-tuning time by 2–4x.

Pi0 (Physical Intelligence, 2024)

Pi0 is a VLA from Physical Intelligence (pi.ai), the startup founded by Sergey Levine, Chelsea Finn, and others from the Berkeley robotics community. Pi0 uses a flow matching action head rather than discrete action tokens — this enables smoother, more dexterous trajectories and higher-frequency control than token-based VLAs. The base model is built on PaliGemma (3B parameters) with a diffusion-inspired action generation head.

Pi0 is notable for demonstrating real-world dexterous manipulation at a level not previously shown by open-weight VLAs: folding laundry, cleaning up cluttered tables, assembling objects. The weights are partially open — the base Pi0 weights are available for research use, but the fine-tuning pipeline and production-scale training data remain proprietary. Physical Intelligence offers commercial fine-tuning services.

SmolVLA (HuggingFace, 2025)

SmolVLA is HuggingFace's entry into the VLA space — a deliberately small, efficient VLA designed to run on consumer hardware. SmolVLA uses SmolLM2 (135M–1.7B parameters) as the language backbone, paired with SigLIP-400M for vision. The total model is 450M–2B parameters depending on configuration — roughly 3–15x smaller than OpenVLA.

SmolVLA runs inference at ~30 Hz on an RTX 3090, making it the first VLA genuinely practical for real-time robot control without enterprise GPUs. It is trained on LeRobot datasets and integrates natively with the LeRobot training stack. Performance lags OpenVLA on complex open-vocabulary tasks but matches or exceeds it on narrow task domains after fine-tuning. For labs with limited GPU budgets, SmolVLA is the most accessible starting point.

RoboFlamingo (Beijing Institute of Technology + Microsoft Research, 2023)

RoboFlamingo fine-tunes the OpenFlamingo vision-language model (9B parameters, based on MPT-7B + CLIP ViT-L) for robot manipulation. Unlike RT-2 and OpenVLA, RoboFlamingo is decoder-only — it uses the language model autoregressively for both vision understanding and action generation without a separate action head. This makes the architecture very flexible but also slower at inference (~3–5 Hz on an A100).

RoboFlamingo is primarily a research contribution rather than a production-ready model. Its key insight is that OpenFlamingo's in-context learning capability (feeding a few demonstration examples directly in the context window) transfers to robot manipulation: you can provide 1–4 demonstration trajectories as context and the model adapts without gradient updates. This "in-context robot learning" is genuinely novel and still underexplored in follow-up work.

At-a-Glance Comparison Table

Model Params Base LLM Training Data Inference Hz Open Source? Fine-tune?
RT-2 55B / 562B PaLI-X / PaLM-E RT-1 dataset + web 1–3 Hz No (Google internal) No
OpenVLA 7B Llama 2 + SigLIP Open X-Embodiment (970k) 6 Hz (A100) Yes (weights + code) Yes (LoRA / full)
Pi0 ~3B PaliGemma pi.ai proprietary 10–25 Hz Weights (research) Via pi.ai (commercial)
SmolVLA 450M–2B SmolLM2 + SigLIP LeRobot datasets 30 Hz (RTX 3090) Yes (fully open) Yes (LeRobot native)
RoboFlamingo 9B OpenFlamingo (MPT-7B) Open X-Embodiment 3–5 Hz (A100) Yes (weights + code) Yes (+ in-context)

Hardware Requirements for Each Model

Model Min GPU (inference) Recommended (inference) Fine-tuning GPU VRAM (inference)
RT-2 N/A (not available) TPU pod (Google) N/A N/A
OpenVLA RTX 3090 (slow) A100 40GB A100 80GB (full) / RTX 4090 (LoRA) ~16GB (BF16)
Pi0 RTX 3090 RTX 4090 / A100 A100 40GB ~8GB (BF16)
SmolVLA RTX 3080 10GB RTX 3090 RTX 3090 / RTX 4090 2–6GB
RoboFlamingo A100 40GB A100 80GB 2× A100 80GB ~22GB (BF16)

Action Representations: A Critical Difference

How each model outputs actions determines what tasks it can do well and how fast it can run:

  • Discrete action tokens (RT-2, OpenVLA, RoboFlamingo): Actions are quantized into bins (typically 256 per dimension) and predicted as text tokens by the LLM. Enables leveraging the full LLM autoregressive machinery and in-context learning. Drawback: quantization error limits precision on fine manipulation; control frequency capped by LLM inference speed.
  • Continuous action head with flow matching (Pi0): Actions are predicted as continuous vectors via a diffusion-inspired flow matching process. Enables smooth, precise trajectories and dexterous control. Higher frequency than token-based models at equivalent parameter count.
  • Hybrid token + continuous (SmolVLA): Language tokens for reasoning; continuous output head for action. Balances language generalization with control precision.

When to Fine-Tune vs Train From Scratch

For virtually all teams, fine-tuning a pre-trained VLA is the right starting point. Training a VLA from scratch requires millions of robot demonstrations and massive compute — resources that only Google, DeepMind, and Physical Intelligence can marshal. The practical question is which pre-trained model to fine-tune and how.

Fine-tune when:

  • Your target tasks are similar in style to the pre-training data (tabletop manipulation, pick-and-place, kitchen tasks)
  • You have 50–500 demonstrations of the target task
  • Your robot morphology is represented in Open X-Embodiment or LeRobot datasets
  • You need reasonable zero-shot generalization within a task category

Train from scratch (or use non-VLA policy) when:

  • Your task domain is radically different from pre-training (underwater manipulation, surgical robotics, industrial assembly)
  • Your action space is non-standard (e.g., hydraulics, cable-driven, >7 DoF hands)
  • You need maximum inference speed and can afford to limit task generality — in this case, ACT or Diffusion Policy trained from scratch will outperform a fine-tuned VLA on a specific narrow task
  • Interpretability or safety constraints make large LLM backbones problematic

Fine-Tuning OpenVLA: Practical Steps

OpenVLA is the most accessible starting point for labs that want to run their own fine-tuning. The following is a minimal workflow:

# Install OpenVLA dependencies
pip install openvla

# Fine-tune with LoRA (requires single RTX 4090)
python experiments/robot/finetune.py \
  --pretrained_checkpoint openvla/openvla-7b \
  --data_root_dir /path/to/your/lerobot/dataset \
  --dataset_name your_task_name \
  --run_root_dir ./runs \
  --use_lora True \
  --lora_rank 32 \
  --batch_size 16 \
  --num_steps 10000 \
  --learning_rate 2e-4

Expected fine-tuning time with LoRA on RTX 4090: ~6 hours for 10k steps with 100 demonstrations. Full fine-tuning requires an A100 80GB and takes ~18–24 hours.

Fine-Tuning SmolVLA: Practical Steps

SmolVLA integrates with LeRobot directly, making it the fastest path to a running VLA on your robot:

python lerobot/scripts/train.py \
  policy=smolvla \
  env=aloha \
  dataset_repo_id=your-username/your-dataset \
  hydra.run.dir=outputs/train/smolvla_task \
  training.batch_size=32 \
  training.num_epochs=100 \
  policy.pretrained_backbone_kwargs.pretrained=true

Which Model Should You Use?

Start with SmolVLA if:

  • Your lab has consumer GPUs (RTX 3090 or similar)
  • You are already using the LeRobot ecosystem
  • You want the fastest iteration cycle from demo collection to robot inference
  • Your task is relatively narrow (one table, consistent lighting, known objects)

Use OpenVLA if:

  • You need the best available open-source generalization performance
  • You have access to an A100 or H100
  • Your task involves novel object categories or instruction phrasings not seen during training
  • You want published benchmark results to compare against

Consider Pi0 if:

  • You need dexterous manipulation quality (multi-finger, contact-rich)
  • You are open to working with Physical Intelligence for fine-tuning
  • Your task requires higher control frequency than OpenVLA can provide

Use RoboFlamingo if:

  • You want in-context robot learning (adapting to new tasks with demonstration examples in the context window, no gradient updates)
  • You are doing research on few-shot generalization rather than optimizing task performance

Reference RT-2 when:

  • Establishing theoretical upper bounds for VLA performance on a benchmark
  • Writing related work — it is the canonical VLA reference
  • But do not plan your system around it — it is not deployable

Data Requirements for Successful Fine-Tuning

Regardless of which model you choose, demonstration quality is the most important variable. VLAs fine-tuned on noisy, inconsistent, or poorly-formatted data will underperform smaller non-VLA policies trained on clean data. Key requirements:

  • Camera placement consistency: The VLA was pre-trained on demonstrations with specific camera configurations. Match as closely as possible, or include camera-pose augmentation in fine-tuning.
  • Episode format: HDF5 (RLDS-compatible) or LeRobot native format. OpenVLA requires RLDS; SmolVLA prefers LeRobot. Convert before training, not during.
  • Language annotations: Each episode needs a natural language instruction string. For narrow tasks, a single fixed instruction per task works. For generalization, vary phrasing during collection.
  • Minimum demo count: SmolVLA: 50+; OpenVLA: 100+; Pi0: 200+ for new task types.

We offer data collection services — teleoperation on ALOHA, OpenArm, and DK1 hardware, with output in LeRobot and RLDS formats — for teams that want to fine-tune a VLA without building the full collection pipeline themselves.

All Models → ACT vs Diffusion Policy → OpenVLA vs Octo →