A Vision-Language-Action (VLA) model is a neural network that takes camera images and a natural language task instruction as input and outputs robot motor commands (joint positions, velocities, or torques) as output. VLAs unify perception, language understanding, and action generation in a single end-to-end architecture.

Which VLA model should I use for my robot project?

For fast prototyping with limited compute: Octo (93M params, 15-20 Hz). For language-conditioned tasks with strong generalization: OpenVLA (7B params). For the LeRobot ecosystem: SmolVLA (500M params). For humanoids: GR00T N1. For maximum performance regardless of cost: Pi0 (proprietary).

How much training data do VLA models need?

Pre-training requires 100K-1M+ episodes. Fine-tuning a pre-trained VLA for a new task typically requires 50-200 demonstrations, which takes 2-7 hours of teleoperation time. This is 4-10x less data than training task-specific policies from scratch.

Can VLA models run in real-time on a robot?

Yes, but model size matters. Octo runs at 15-20 Hz on a Jetson AGX Orin. SmolVLA runs at 10+ Hz. OpenVLA runs at 2-5 Hz and requires action chunking for real-time deployment. Larger models use action chunking to amortize inference cost over multiple timesteps.

Comparison

VLA Models Comparison 2026: Octo, OpenVLA, Pi0, GR00T, SmolVLA, and OpenPI

Vision-Language-Action (VLA) models are the dominant architecture for generalist robot learning in 2026. This page compares every major VLA model — architecture details, training data, inference speed, hardware compatibility, and real-robot performance — so you can choose the right model for your project.

Updated April 16, 2026 · SVRC Research Team · 12 min read

What Are VLA Models?

A Vision-Language-Action (VLA) model is an end-to-end neural network that takes camera images (vision) and a natural language task instruction (language) as input and outputs robot motor commands (action) — joint positions, end-effector velocities, or torques that a robot can directly execute.

The key insight behind VLAs is architectural unification. Before VLAs, robot learning systems were modular pipelines: a separate vision model detected objects, a separate planner generated a strategy, and a separate controller produced motor commands. Each module was trained independently, and errors cascaded between modules. VLAs replace this pipeline with a single model trained end-to-end on demonstrations, eliminating the interface engineering and error propagation.

The "language" component is what makes VLAs qualitatively different from earlier robot learning approaches like ACT or Diffusion Policy. By conditioning on natural language instructions ("pick up the red cup," "fold the towel in half"), a single VLA can handle an open-ended set of tasks rather than being limited to the specific task it was trained on. The language grounding comes from the pre-trained LLM backbone, which transfers its world knowledge to the physical domain.

Architecture Comparison

Model	Vision Encoder	LLM Backbone	Action Head	Action Space	Context Length
Octo	Custom ViT	Transformer (93M total)	Diffusion	Continuous (EEF delta)	2 frames
OpenVLA	SigLIP (400M)	LLaMA 2 (7B)	Discrete tokenization	256 bins per dim	2048 tokens
Pi0	PaLI-based (est.)	Custom (~3B total)	Flow Matching	Continuous (joint + EEF)	Not disclosed
GR00T N1	Eagle (NVIDIA)	Custom (~2B total)	Dual (diffusion + MLP)	Continuous (whole-body)	Not disclosed
SmolVLA	SigLIP (400M)	SmolLM2 (135M)	Diffusion	Continuous (joint pos)	1024 tokens
OpenPI	SigLIP (400M)	LLaMA-based (~3B)	Flow Matching	Continuous (joint + EEF)	2048 tokens

Model Deep Dives

Octo (UC Berkeley)

Octo is the workhorse of the open-source VLA community. Built at the Berkeley Robot Learning Lab (Ghosh et al., 2024), Octo was designed with a clear engineering philosophy: be small enough to fine-tune on a single GPU, fast enough for real-time control, and general enough to work across embodiments.

Architecture: A compact transformer that processes tokenized image observations and language instructions through shared attention layers, then generates actions through a diffusion head. The diffusion head predicts a 16-step action chunk (0.32 seconds at 50 Hz), providing temporal smoothness without requiring a separate action chunking mechanism.

Training data: 800K episodes from the Open X-Embodiment dataset, heavily weighted toward the Bridge V2 dataset (WidowX robot) and RT-1 data (Google's Everyday Robots). This means Octo works best out-of-the-box on tabletop manipulation with WidowX-class arms.

Fine-tuning: Octo fine-tunes in 20-30 minutes on a single A100 with 50-100 demonstrations. This fast turnaround makes it ideal for rapid iteration: collect 50 demos in the morning, fine-tune over lunch, evaluate in the afternoon. At SVRC, we use Octo as the default baseline for pilot data collection projects because it provides the fastest feedback on data quality.

Limitations: The 93M parameter budget limits language understanding and visual reasoning. Octo struggles with tasks that require spatial reasoning ("put the cup to the left of the plate") or novel object categories not well-represented in Bridge V2. For these tasks, OpenVLA's larger LLM backbone performs significantly better.

Best for: Teams that need a working robot policy fast, have limited compute, or are iterating on data collection quality before committing to a larger model.

OpenVLA (Stanford / Berkeley)

OpenVLA (Kim et al., 2024) is the most capable open-source VLA, built on the insight that a 7B-parameter LLM backbone provides enough capacity for genuine language-conditioned task generalization. It was the first open model to demonstrate zero-shot transfer to novel task descriptions on unseen objects.

Architecture: SigLIP vision encoder (400M params) feeds visual tokens into a fine-tuned LLaMA 2 (7B params). Actions are tokenized into 256 discrete bins per dimension and predicted as next tokens — the same mechanism LLMs use for text generation. This discrete tokenization is less expressive than continuous diffusion heads but enables the model to leverage the full autoregressive generation machinery of the LLM.

Training data: 970K episodes from the Open X-Embodiment dataset (22 robot embodiments). The broader embodiment coverage compared to Octo gives OpenVLA better cross-robot generalization.

Fine-tuning: Requires 8xA100 (or equivalent) for efficient fine-tuning due to the 7B parameter size. LoRA fine-tuning reduces this to 1-2 A100s at the cost of slightly lower performance. 100-200 demonstrations recommended for a new task. Fine-tuning takes 4-8 hours.

Inference: ~5 Hz on A100, ~2 Hz on Jetson AGX Orin. Requires action chunking (predict 10-20 steps, execute at 50 Hz) for real-time deployment. The 200-500ms per-inference latency means the robot operates on 0.2-0.4 second plans — acceptable for most manipulation tasks but problematic for reactive tasks like catching or insertion.

Best for: Language-conditioned multi-task systems where you want a single model to handle diverse task instructions. Research teams with adequate GPU resources.

Pi0 (Physical Intelligence)

Pi0 is the proprietary state-of-the-art from Physical Intelligence, the company founded by Sergey Levine, Chelsea Finn, and other Berkeley/Stanford alumni. Pi0's distinguishing technical contribution is the flow matching action head.

Flow matching vs. diffusion: Both are generative models that produce action trajectories, but flow matching uses straight-line interpolation paths (optimal transport) rather than the curved noise-to-signal paths of diffusion. In practice, flow matching converges in fewer denoising steps (4-8 vs. 16-32 for diffusion), enabling faster inference without quality loss.

Performance: Pi0 demonstrations show cross-task generalization that no open-source model has matched: a single checkpoint folding laundry, clearing tables, packing boxes, and operating kitchen appliances. The gap between Pi0 and open models is estimated at 15-25% success rate on multi-task benchmarks, attributable to both the larger proprietary dataset and the flow matching architecture.

Access: Pi0 is not publicly available. Physical Intelligence offers API access to select partners. The community reimplementation OpenPI (see below) provides an approximation using the published architecture details.

Best for: Teams with PI partnership agreements that need state-of-the-art performance and are willing to accept vendor lock-in.

GR00T N1 (NVIDIA)

GR00T N1 is NVIDIA's humanoid-focused foundation model, designed for the Isaac Lab ecosystem and optimized for Jetson Thor edge deployment.

Dual-system architecture: GR00T N1 uses two interconnected models: a "slow" VLA backbone (2-5 Hz) that interprets visual scenes and language instructions to produce high-level action plans, and a "fast" policy (200+ Hz) that converts plans to motor commands with reactive feedback. This mirrors the biological distinction between deliberate planning and reflexive motor control.

Sim-first training: Pre-trained on 1M+ episodes in Isaac Lab (physics simulation with domain randomization), then fine-tuned with 50K-100K real humanoid episodes. This sim-first approach works well for locomotion and whole-body balance but still requires significant real data for manipulation tasks.

Hardware ecosystem: Optimized for NVIDIA Jetson Thor (the humanoid-specific compute platform). Also runs on Jetson AGX Orin with reduced performance. GR00T SDK provides standardized interfaces for Figure, Agility Robotics, Apptronik, and 1X humanoids.

Best for: Humanoid robot projects, especially those already in the NVIDIA ecosystem. Not well-suited for tabletop manipulation arms.

SmolVLA (Hugging Face)

SmolVLA is Hugging Face's bet that VLA models can be made small enough for consumer hardware without sacrificing useful generalization. At 500M parameters total, it is the most accessible VLA for researchers and hobbyists.

Architecture: SigLIP vision encoder (400M) paired with SmolLM2 (135M parameter language model) and a diffusion action head. The small language model limits complex reasoning but still provides meaningful language conditioning for common task descriptions.

LeRobot integration: SmolVLA is a first-class citizen in the LeRobot framework. Training, evaluation, and deployment scripts are maintained as part of the LeRobot codebase. Dataset loading from Hugging Face Hub is native. This tight integration means you can go from data collection to deployed policy with a single toolchain.

Performance: On standard benchmarks (SIMPLER, Bridge V2 eval), SmolVLA achieves 60-75% of OpenVLA's success rate at 1/14th the parameter count. For single-task fine-tuning, the gap narrows to 5-10% with sufficient demonstrations (200+).

Best for: LeRobot users, hobbyists with consumer GPUs, education, and teams that need to run inference on Jetson Orin NX ($499) rather than Jetson AGX Orin ($1,999).

OpenPI (Community)

OpenPI is a community reimplementation of Physical Intelligence's Pi0 architecture, based on the published paper and reverse-engineered design choices. It provides the flow matching action head that distinguishes Pi0 from other VLAs, in an open-source package.

Architecture: SigLIP vision encoder with a LLaMA-based backbone (~3B parameters total) and flow matching action head. The action head uses 8-step flow matching for inference, achieving smooth continuous action predictions.

Training: Community-trained on a combination of Open X-Embodiment, DROID, and community-contributed datasets. Total training data is smaller than the original Pi0 dataset, which limits generalization.

Performance: Achieves approximately 70-80% of reported Pi0 performance on comparable benchmarks. The flow matching head provides notably smoother action trajectories than diffusion-based alternatives (Octo, SmolVLA), which some users prefer for contact-rich tasks.

Best for: Researchers interested in flow matching architectures who want open-source access and the ability to inspect and modify the full model.

Hardware Compatibility

Model	WidowX / ALOHA	OpenArm 101	Franka	Unitree G1	Koch / SO-100
Octo	Native	Fine-tune	Native	Fine-tune	Fine-tune
OpenVLA	Native	Fine-tune	Native	Fine-tune	Fine-tune
Pi0	Supported	Via PI API	Supported	Via PI API	Not supported
GR00T N1	Not targeted	Not targeted	Not targeted	Native	Not targeted
SmolVLA	Native	Native (LeRobot)	Fine-tune	Fine-tune	Native (LeRobot)
OpenPI	Community	Fine-tune	Community	Fine-tune	Community

"Native" = pre-trained on data from this embodiment; works zero-shot or with minimal fine-tuning. "Fine-tune" = requires 50-200 demonstrations on this specific robot. "Community" = community-maintained integration; may require manual setup.

Training Your Own VLA

Most teams should fine-tune an existing VLA rather than train from scratch. Here is the practical decision tree:

Fine-tuning (recommended for most teams)

Choose your base model: Octo for speed, OpenVLA for capability, SmolVLA for LeRobot integration
Collect 50-200 demonstrations of your target task on your specific robot platform (see data collection guide)
Format data: Convert to the model's expected format (LeRobot dataset format for SmolVLA/Octo, RLDS for OpenVLA)
Fine-tune: Octo: 30 min on 1 GPU. SmolVLA: 1-2 hours on 1 GPU. OpenVLA: 4-8 hours on 8 GPUs (or 1-2 GPUs with LoRA)
Evaluate: Run 50 evaluation episodes, measure success rate, analyze failure modes

Training from scratch (for specialized architectures)

Training from scratch requires 100K+ demonstrations and significant compute (64-256 GPU-hours for Octo-scale, 1,000+ GPU-hours for OpenVLA-scale). This is appropriate only if you have a novel architecture, proprietary large-scale data, or requirements that existing models cannot meet.

Inference Speed Benchmarks

Model	A100 (80GB)	Jetson AGX Orin	Jetson Orin NX	RTX 4090	Action Chunking Needed?
Octo	40+ Hz	15-20 Hz	8-12 Hz	30+ Hz	No (built-in)
OpenVLA	5-8 Hz	1.5-3 Hz	OOM	4-6 Hz	Yes (10-20 steps)
Pi0	10-15 Hz	5-8 Hz	Not tested	8-12 Hz	Optional
SmolVLA	25+ Hz	10-15 Hz	5-8 Hz	20+ Hz	No (built-in)
OpenPI	8-12 Hz	3-5 Hz	OOM	6-10 Hz	Optional

Benchmarks measured with FP16 precision, batch size 1, single camera input. Real-world performance varies with camera resolution and number of views.

When to Use Which VLA

Use this decision tree to choose your model:

Need results this week with minimal compute? → Octo. Fine-tune in 30 minutes, deploy same day.
Need language-conditioned multi-task with strong generalization? → OpenVLA. Best open-source language grounding.
Working with LeRobot and want native ecosystem support? → SmolVLA. Tightest LeRobot integration.
Building a humanoid robot application? → GR00T N1. Purpose-built for whole-body humanoid control.
Want the smoothest action trajectories (contact-rich tasks)? → OpenPI. Flow matching produces smoother motions than diffusion.
Need maximum performance and have a PI partnership? → Pi0. State-of-the-art, but proprietary.
Have a very specific single task and don't need language conditioning? → Consider ACT or Diffusion Policy instead of a VLA. They are simpler, faster to train, and often achieve higher single-task success rates.

Collecting VLA-Compatible Training Data

VLA training data has specific requirements that differ from data for simpler policies:

Camera Requirements

Resolution: Minimum 640x480, recommended 1280x720. VLA vision encoders downsample to 224x224 or 336x336 internally, but higher source resolution preserves detail during downsampling.
Frame rate: 30 fps for camera frames, synchronized with 50 Hz joint state recording. The mismatch is handled by the data loader (nearest-neighbor interpolation).
Views: Minimum 1 wrist camera + 1 external camera. Most VLAs accept 2-4 views. Additional views improve performance but increase storage and inference cost.
Calibration: Camera intrinsics and extrinsics should be recorded but are not required by most VLA architectures (they learn visual features directly from pixels).

Action Format

Joint positions: Record absolute joint positions at 50 Hz. This is the most universal format — all VLAs can consume it.
End-effector pose: Also record EEF position + quaternion orientation if available. Some models (Octo, Pi0) can use EEF-space actions directly.
Gripper state: Binary (open/closed) or continuous gripper position. Record at the same 50 Hz frequency as joint data.

Language Annotations

Every episode needs a natural language task description: "pick up the red cup and place it on the saucer"
Use specific, descriptive language. "Do the task" is useless; "pick up the leftmost object" is useful
Vary language across episodes for the same task to improve language generalization: "grab the mug," "pick up the coffee cup," "take the cup from the table"

Output Formats

HDF5: Standard for ACT, Diffusion Policy, and direct Octo consumption
RLDS (TensorFlow Datasets): Required for OpenVLA pre-training data format
LeRobot format: Parquet + video files on Hugging Face Hub. Native for SmolVLA

SVRC's data collection service outputs in all three formats by default. Every dataset delivered includes HDF5, RLDS, and LeRobot-compatible exports, so you can train on any VLA without reformatting.