Technical Guide

VLA Models Compared: OpenVLA vs π0 vs SmolVLA vs RT-2 (2026 Guide)

Vision-Language-Action models are the foundation models of robotics — neural networks that take camera images and language instructions as input and output robot motor commands. This guide compares every major VLA model available in 2026, covering architecture, performance, data requirements, and practical fine-tuning guidance.

What Is a VLA Model?

A Vision-Language-Action (VLA) model is a neural network that maps visual observations and language instructions to robot actions. The core idea: take a pre-trained vision-language model (which already understands images and text) and add an action output head that produces motor commands for a robot. The vision-language backbone provides semantic understanding ("pick up the red cup"), while the action head translates that understanding into physical movements (a sequence of joint angles or end-effector positions).

Architecture Overview

All VLA models share three components, though their implementations differ significantly:

Vision encoder: Converts camera images into feature vectors. Common choices are Vision Transformers (ViT) or SigLIP. The vision encoder is typically pre-trained on internet-scale image datasets (ImageNet, LAION) and frozen or lightly fine-tuned during VLA training.
Language backbone: Processes text instructions and integrates them with visual features. Most VLAs use a pre-trained large language model (LLaMA, Gemma, PaLI) as the backbone, leveraging its reasoning and instruction-following capabilities.
Action head: Generates robot actions from the fused vision-language representation. This is where VLA architectures diverge most significantly.

Action Representation: The Key Differentiator

How a VLA represents and generates actions determines its performance characteristics:

Discrete tokens (RT-2, OpenVLA): Actions are quantized into discrete tokens and generated autoregressively, like text tokens. Simple to implement but produces jerky single-step actions.
Chunked prediction (SmolVLA): The model predicts a sequence of future actions at once (typically 10–50 steps), inspired by ACT. Produces smoother trajectories and is more compute-efficient at inference time.
Flow matching (π0): Actions are generated by iteratively denoising a random sample into a valid action trajectory. Produces the smoothest actions and handles multi-modal distributions (tasks with multiple valid solutions) best, but requires multiple denoising steps at inference time.
Diffusion (Octo): Similar to flow matching but uses a DDPM-style diffusion process. Good for multi-modal actions but slower at inference than chunked prediction.

VLA Model Comparison Table

Model	Creator	Open Source	Parameters	Action Type	Hardware Compatibility
RT-2	Google DeepMind	No	55B	Discrete tokens	Google RT-X robots
OpenVLA	Stanford / Berkeley	Yes (Apache 2.0)	7.5B	Discrete tokens	Any
π0	Physical Intelligence	No	~3B (est.)	Flow matching	Any
SmolVLA	HuggingFace	Yes (Apache 2.0)	450M	Chunked (ACT-style)	Any
Octo	UC Berkeley	Yes (MIT)	~93M	Diffusion	Any
RoboFlamingo	Shanghai AI Lab	Yes	~9B	Autoregressive	Any
Helix	Figure AI	No	Undisclosed	Undisclosed	Figure 02

Deep Dive: Each VLA Model

RT-2 (Google DeepMind)

Architecture: RT-2 is built on PaLI-X (55B parameters) or PaLM-E (12B parameters). The vision encoder is a ViT-e (4B parameters), and the language model processes both visual tokens and text tokens. Actions are represented as text strings of discretized joint positions (e.g., "1 128 91 241 5 101 127") and generated autoregressively by the language model.

Training data: Trained on Google's RT-X dataset containing approximately 130,000 demonstrations from 13 robot types. Additionally leverages internet-scale vision-language pre-training from the PaLI-X backbone.

Performance: RT-2 demonstrated emergent capabilities — following instructions involving concepts never seen during robot training (e.g., "move the object to the Taylor Swift album"). On standard benchmarks, it achieved 62% success on novel semantic concepts, compared to 32% for RT-1.

Fine-tuning requirements: Not publicly available. Google has not released weights or training code.

Pros: Strongest emergent semantic generalization; first proof that VLA concept works at scale.

Cons: Closed source; requires massive compute; 55B parameters make real-time inference challenging; single-step discrete actions produce jerky motion.

OpenVLA (Stanford / Berkeley)

Architecture: Built on the Prismatic VLM backbone (Llama-2 7B + SigLIP + DinoV2 vision encoders). Actions are discretized into 256 bins per dimension and predicted autoregressively as tokens. The dual vision encoder (SigLIP for semantic features + DinoV2 for spatial features) gives OpenVLA stronger visual grounding than single-encoder designs.

Training data: Trained on the Open X-Embodiment dataset (~970K trajectories from 22 robot embodiments). Fine-tuning requires as few as 100–200 demonstrations for a new task on a specific robot.

Performance: On the LIBERO benchmark, fine-tuned OpenVLA achieves 85–95% success depending on task difficulty. On real-robot evaluations with WidowX and Franka arms, it matches or exceeds task-specific ACT policies on seen tasks and significantly outperforms them on unseen task variations.

Fine-tuning requirements: 2x A100 (80 GB) for full fine-tuning; 1x A100 for LoRA fine-tuning. Training time: 8–24 hours for 200 demonstrations. LoRA reduces memory to ~40 GB at a small performance cost.

Pros: Fully open source with excellent documentation; strongest open-source baseline; large community and active development; works on any robot with simple action space mapping.

Cons: 7.5B parameters means ~300ms per action step on consumer GPUs (too slow for reactive tasks); discrete tokens produce less smooth actions than flow matching or chunked prediction; requires significant VRAM for fine-tuning.

π0 (Physical Intelligence)

Architecture: π0 uses a vision-language backbone (likely PaLI-based, details not fully disclosed) with a flow matching action head. Flow matching generates actions by iteratively transforming a noise sample into an action trajectory through a learned velocity field. This produces smooth, continuous trajectories without the discretization artifacts of token-based approaches.

Training data: Trained on a proprietary dataset of thousands of hours of teleoperation data across multiple robot platforms (Franka, UR5, mobile manipulators, dexterous hands). The dataset is significantly larger and more diverse than Open X-Embodiment.

Performance: π0 demonstrated the broadest task generalization of any robot policy, performing laundry folding, table bussing, box packing, and assembly tasks across different robot embodiments from a single model. Quantitative benchmarks show 80–95% success on trained tasks and 40–60% zero-shot on related but unseen tasks.

Fine-tuning requirements: Not publicly available. Physical Intelligence offers API access for select partners.

Pros: Smoothest action generation; broadest task generalization demonstrated; handles contact-rich manipulation better than token-based VLAs; multi-embodiment support.

Cons: Closed source; no public weights; inference requires multiple denoising steps (10–20), adding latency; commercial access only through Physical Intelligence partnership.

SmolVLA (HuggingFace)

Architecture: SmolVLA uses SmolVLM (a compact vision-language model based on Idefics-3) as its backbone with a chunked action prediction head. The action head predicts a sequence of future actions (action chunk) in a single forward pass, inspired by ACT's chunked prediction. The vision encoder is a SigLIP-400M model, and the language backbone is a 500M parameter transformer.

Training data: Trained on a curated subset of Open X-Embodiment plus LeRobot community datasets. Fine-tuning requires 50–200 demonstrations.

Performance: On the LIBERO benchmark, SmolVLA achieves 82–90% success — within 5% of OpenVLA despite having 16x fewer parameters. On real-robot evaluations, fine-tuned SmolVLA matches OpenVLA on simple tasks and slightly underperforms on complex multi-step tasks. The key advantage is inference speed: SmolVLA runs at 15–30 Hz on a single RTX 4090, fast enough for reactive manipulation.

Fine-tuning requirements: 1x RTX 4090 (24 GB) for full fine-tuning; training time 4–12 hours for 200 demonstrations. The low resource requirement makes SmolVLA accessible to any lab with a modern consumer GPU.

Pros: Smallest and fastest open-source VLA; runs on consumer hardware; chunked prediction produces smooth actions; full training code available on HuggingFace; active community.

Cons: Smaller backbone limits language understanding compared to 7B+ models; less robust to novel instructions; chunked prediction can struggle with very long-horizon tasks.

Octo (UC Berkeley)

Architecture: Octo uses a transformer backbone with a diffusion action head. Unlike the VLA models above, Octo does not use a pre-trained language model backbone. Instead, it uses a smaller transformer trained from scratch on robot data, with optional language conditioning. The diffusion head generates action trajectories through iterative denoising.

Training data: Trained on the Open X-Embodiment dataset (~800K trajectories). Designed specifically for cross-embodiment transfer and rapid fine-tuning.

Performance: On the SIMPLER benchmark, Octo-Base achieves 50–75% success across multiple simulated environments. Its strength is adaptation speed: fine-tuning on 20–50 demonstrations of a new task takes 30 minutes to 2 hours on a single GPU.

Fine-tuning requirements: 1x RTX 3090 or better; training time 30 minutes to 2 hours. The lowest compute requirement of any model in this comparison.

Pros: Fastest fine-tuning; smallest compute requirement; diffusion head handles multi-modal actions well; designed for rapid prototyping.

Cons: Weakest language understanding (no LLM backbone); lower absolute performance than OpenVLA or SmolVLA on complex tasks; diffusion inference is slower than chunked prediction.

RoboFlamingo (Shanghai AI Lab)

Architecture: Built on OpenFlamingo (a multi-modal model based on MPT-7B), RoboFlamingo adds a robot action prediction head to a pre-trained vision-language model. It uses perceiver resampler modules to efficiently process multi-frame visual histories, enabling the model to reason about temporal dynamics.

Training data: Trained on a combination of CALVIN benchmark data and custom real-robot demonstrations. Can be fine-tuned on task-specific datasets of 100–500 demonstrations.

Performance: On the CALVIN benchmark, RoboFlamingo achieves state-of-the-art results on multi-step long-horizon tasks. Its temporal visual reasoning (processing sequences of frames rather than single frames) gives it an advantage on tasks requiring memory of previous steps.

Fine-tuning requirements: 2x A100 (80 GB) for full fine-tuning due to the 9B parameter backbone.

Pros: Strong long-horizon task performance; temporal visual reasoning; open source.

Cons: Large model size (9B); limited real-robot evaluations published; smaller community than OpenVLA or SmolVLA; autoregressive action generation produces single-step actions.

How to Choose: VLA Decision Framework

Selecting the right VLA model depends on three factors: your compute budget, your data availability, and your task requirements.

By Compute Budget

Budget	Fine-Tuning Hardware	Recommended Model	Inference Speed
Consumer ($0–$2K GPU)	1x RTX 4090	SmolVLA or Octo	15–30 Hz
Research lab ($5K–$20K)	1–2x A100	OpenVLA (LoRA) or SmolVLA	3–15 Hz
Well-funded lab ($20K+)	4+ A100 or H100	OpenVLA (full) or custom VLA	3–10 Hz
Enterprise	Cloud cluster	π0 (API) or custom training	Varies

By Task Type

Task	Key Requirement	Best Model	Why
Simple pick-and-place	Speed, low compute	SmolVLA or Octo	Fast inference, minimal data needed
Language-conditioned manipulation	Semantic understanding	OpenVLA	Strongest language backbone among open models
Contact-rich manipulation	Smooth, precise actions	π0 or SmolVLA	Flow matching / chunked actions handle contact better
Multi-step long-horizon	Temporal reasoning	RoboFlamingo or OpenVLA	Multi-frame processing, strong LLM backbone
Rapid prototyping	Fast iteration	Octo	30-min fine-tuning on 20 demos
Bimanual coordination	Multi-arm action space	SmolVLA or Octo	Flexible action spaces, chunked prediction

By Data Availability

20–50 demonstrations: Octo (designed for rapid fine-tuning with minimal data)
50–200 demonstrations: SmolVLA (good performance with moderate data) or ACT (non-VLA baseline, very effective with limited data)
200–500 demonstrations: OpenVLA (full fine-tuning achieves best results at this scale)
500+ demonstrations: Any model benefits; consider OpenVLA full fine-tuning or training a custom model

Fine-Tuning a VLA on Your Robot Data

Data Format Requirements

All major VLA models expect training data in one of two formats:

RLDS (Reinforcement Learning Datasets): The standard for Open X-Embodiment models. Used by Octo, OpenVLA, and RT-2. Each episode is stored as a sequence of (observation, action, reward) tuples in TFRecord format.
LeRobot format: HuggingFace's format for SmolVLA and the LeRobot training framework. Each episode is stored as a Parquet file with synchronized image and action columns, plus a metadata JSON file.

SVRC's data collection pipeline outputs both formats. See our data format guide for conversion details.

GPU Requirements and Training Time

Model	Method	Min GPU	VRAM	Time (200 demos)
SmolVLA	Full fine-tune	1x RTX 4090	24 GB	4–12 hours
OpenVLA	LoRA	1x A100	40 GB	8–16 hours
OpenVLA	Full fine-tune	2x A100	80 GB each	12–24 hours
Octo	Full fine-tune	1x RTX 3090	24 GB	0.5–2 hours
RoboFlamingo	Full fine-tune	2x A100	80 GB each	16–32 hours

LoRA vs Full Fine-Tuning

Low-Rank Adaptation (LoRA) freezes the pre-trained weights and trains small adapter matrices, reducing memory requirements by 50–70%. For OpenVLA:

Full fine-tuning: Updates all 7.5B parameters. Requires 2x A100 (80 GB). Achieves best performance, especially when your task distribution differs significantly from the pre-training data.
LoRA fine-tuning: Trains ~50M adapter parameters (rank 32). Requires 1x A100 (40 GB). Performance within 2–5% of full fine-tuning for most tasks. Recommended as the default approach unless you have excess compute.

Example: Fine-Tuning SmolVLA with LeRobot

# Install LeRobot
# pip install lerobot

from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
from lerobot.common.policies.smolvla.configuration_smolvla import SmolVLAConfig
from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy

# Load your dataset (collected at SVRC or locally)
dataset = LeRobotDataset("your_org/your_dataset")

# Configure the policy
config = SmolVLAConfig(
    input_shapes={
        "observation.images.top": [3, 480, 640],
        "observation.images.wrist": [3, 480, 640],
        "observation.state": [7],  # 6 joints + gripper
    },
    output_shapes={
        "action": [7],
    },
    action_chunk_size=50,
)

# Initialize from pre-trained weights
policy = SmolVLAPolicy(config, dataset_stats=dataset.stats)
policy.load_pretrained("HuggingFaceTB/SmolVLA-base")

# Fine-tune (see LeRobot docs for full training loop)
# python lerobot/scripts/train.py \
#     --policy.type=smolvla \
#     --dataset.repo_id=your_org/your_dataset \
#     --training.num_epochs=100

SVRC Data Collection to Fine-Tuning Pipeline

The typical workflow for teams using SVRC's data collection services:

Scope the task: Define manipulation tasks, object variations, success criteria. SVRC's solutions team helps determine the minimum dataset size (typically 100–200 demos for VLA fine-tuning).
Collect demonstrations: SVRC operators collect expert demonstrations on OpenArm 101 or DK1 with synchronized multi-camera recording. Pilot campaign: 100 demos ($2,500). Full campaign: 500 demos ($8,000).
Receive formatted data: Data is delivered in both HDF5 (ACT/Diffusion Policy compatible) and LeRobot format (SmolVLA/OpenVLA compatible).
Fine-tune your VLA: Use the delivered dataset to fine-tune SmolVLA (consumer GPU) or OpenVLA (A100). SVRC can provide compute access or guidance on cloud GPU setup.
Evaluate and iterate: Deploy the fine-tuned policy on your robot. If success rates are below target, collect additional targeted demonstrations for failure cases.

Real-World Benchmarks: Accuracy vs Inference Speed

Model	LIBERO-Long (5 tasks)	SIMPLER (avg)	Inference (RTX 4090)	Inference (A100)
RT-2 (55B)	N/A (closed)	N/A (closed)	N/A	~1 Hz
OpenVLA (7.5B)	90%	72%	~3 Hz	~8 Hz
π0	N/A (closed)	N/A (closed)	N/A	~5 Hz (est.)
SmolVLA (450M)	86%	68%	~20 Hz	~30 Hz
Octo (93M)	72%	58%	~10 Hz	~25 Hz
RoboFlamingo (9B)	82% (CALVIN)	N/A	~2 Hz	~6 Hz

The inference speed tradeoff: For reactive tasks (catching objects, dynamic assembly), you need ≥10 Hz control. This limits practical choices to SmolVLA and Octo on consumer hardware. For slower manipulation tasks (pick-and-place, packaging), 3–5 Hz is sufficient, making OpenVLA viable. For contact-rich tasks requiring precision over speed (insertion, assembly), slower inference is acceptable if action quality is high.

Important caveat: Benchmark numbers are measured on specific datasets and evaluation protocols. Real-world performance depends on your specific task, environment, and robot. A model that achieves 90% on LIBERO may achieve 60% on your task if the visual environment or object set differs significantly. Always evaluate on your target setup.