VLA Models Comparison 2026: Octo, OpenVLA, Pi0, GR00T, SmolVLA, and OpenPI
Vision-Language-Action (VLA) models are the dominant architecture for generalist robot learning in 2026. This page compares every major VLA model — architecture details, training data, inference speed, hardware compatibility, and real-robot performance — so you can choose the right model for your project.
Updated April 16, 2026 · SVRC Research Team · 12 min read
What Are VLA Models?
A Vision-Language-Action (VLA) model is an end-to-end neural network that takes camera images (vision) and a natural language task instruction (language) as input and outputs robot motor commands (action) — joint positions, end-effector velocities, or torques that a robot can directly execute.
The key insight behind VLAs is architectural unification. Before VLAs, robot learning systems were modular pipelines: a separate vision model detected objects, a separate planner generated a strategy, and a separate controller produced motor commands. Each module was trained independently, and errors cascaded between modules. VLAs replace this pipeline with a single model trained end-to-end on demonstrations, eliminating the interface engineering and error propagation.
The "language" component is what makes VLAs qualitatively different from earlier robot learning approaches like ACT or Diffusion Policy. By conditioning on natural language instructions ("pick up the red cup," "fold the towel in half"), a single VLA can handle an open-ended set of tasks rather than being limited to the specific task it was trained on. The language grounding comes from the pre-trained LLM backbone, which transfers its world knowledge to the physical domain.
Architecture Comparison
| Model | Vision Encoder | LLM Backbone | Action Head | Action Space | Context Length |
|---|---|---|---|---|---|
| Octo | Custom ViT | Transformer (93M total) | Diffusion | Continuous (EEF delta) | 2 frames |
| OpenVLA | SigLIP (400M) | LLaMA 2 (7B) | Discrete tokenization | 256 bins per dim | 2048 tokens |
| Pi0 | PaLI-based (est.) | Custom (~3B total) | Flow Matching | Continuous (joint + EEF) | Not disclosed |
| GR00T N1 | Eagle (NVIDIA) | Custom (~2B total) | Dual (diffusion + MLP) | Continuous (whole-body) | Not disclosed |
| SmolVLA | SigLIP (400M) | SmolLM2 (135M) | Diffusion | Continuous (joint pos) | 1024 tokens |
| OpenPI | SigLIP (400M) | LLaMA-based (~3B) | Flow Matching | Continuous (joint + EEF) | 2048 tokens |
Model Deep Dives
Octo (UC Berkeley)
Octo is the workhorse of the open-source VLA community. Built at the Berkeley Robot Learning Lab (Ghosh et al., 2024), Octo was designed with a clear engineering philosophy: be small enough to fine-tune on a single GPU, fast enough for real-time control, and general enough to work across embodiments.
Architecture: A compact transformer that processes tokenized image observations and language instructions through shared attention layers, then generates actions through a diffusion head. The diffusion head predicts a 16-step action chunk (0.32 seconds at 50 Hz), providing temporal smoothness without requiring a separate action chunking mechanism.
Training data: 800K episodes from the Open X-Embodiment dataset, heavily weighted toward the Bridge V2 dataset (WidowX robot) and RT-1 data (Google's Everyday Robots). This means Octo works best out-of-the-box on tabletop manipulation with WidowX-class arms.
Fine-tuning: Octo fine-tunes in 20-30 minutes on a single A100 with 50-100 demonstrations. This fast turnaround makes it ideal for rapid iteration: collect 50 demos in the morning, fine-tune over lunch, evaluate in the afternoon. At SVRC, we use Octo as the default baseline for pilot data collection projects because it provides the fastest feedback on data quality.
Limitations: The 93M parameter budget limits language understanding and visual reasoning. Octo struggles with tasks that require spatial reasoning ("put the cup to the left of the plate") or novel object categories not well-represented in Bridge V2. For these tasks, OpenVLA's larger LLM backbone performs significantly better.
Best for: Teams that need a working robot policy fast, have limited compute, or are iterating on data collection quality before committing to a larger model.
OpenVLA (Stanford / Berkeley)
OpenVLA (Kim et al., 2024) is the most capable open-source VLA, built on the insight that a 7B-parameter LLM backbone provides enough capacity for genuine language-conditioned task generalization. It was the first open model to demonstrate zero-shot transfer to novel task descriptions on unseen objects.
Architecture: SigLIP vision encoder (400M params) feeds visual tokens into a fine-tuned LLaMA 2 (7B params). Actions are tokenized into 256 discrete bins per dimension and predicted as next tokens — the same mechanism LLMs use for text generation. This discrete tokenization is less expressive than continuous diffusion heads but enables the model to leverage the full autoregressive generation machinery of the LLM.
Training data: 970K episodes from the Open X-Embodiment dataset (22 robot embodiments). The broader embodiment coverage compared to Octo gives OpenVLA better cross-robot generalization.
Fine-tuning: Requires 8xA100 (or equivalent) for efficient fine-tuning due to the 7B parameter size. LoRA fine-tuning reduces this to 1-2 A100s at the cost of slightly lower performance. 100-200 demonstrations recommended for a new task. Fine-tuning takes 4-8 hours.
Inference: ~5 Hz on A100, ~2 Hz on Jetson AGX Orin. Requires action chunking (predict 10-20 steps, execute at 50 Hz) for real-time deployment. The 200-500ms per-inference latency means the robot operates on 0.2-0.4 second plans — acceptable for most manipulation tasks but problematic for reactive tasks like catching or insertion.
Best for: Language-conditioned multi-task systems where you want a single model to handle diverse task instructions. Research teams with adequate GPU resources.
Pi0 (Physical Intelligence)
Pi0 is the proprietary state-of-the-art from Physical Intelligence, the company founded by Sergey Levine, Chelsea Finn, and other Berkeley/Stanford alumni. Pi0's distinguishing technical contribution is the flow matching action head.
Flow matching vs. diffusion: Both are generative models that produce action trajectories, but flow matching uses straight-line interpolation paths (optimal transport) rather than the curved noise-to-signal paths of diffusion. In practice, flow matching converges in fewer denoising steps (4-8 vs. 16-32 for diffusion), enabling faster inference without quality loss.
Performance: Pi0 demonstrations show cross-task generalization that no open-source model has matched: a single checkpoint folding laundry, clearing tables, packing boxes, and operating kitchen appliances. The gap between Pi0 and open models is estimated at 15-25% success rate on multi-task benchmarks, attributable to both the larger proprietary dataset and the flow matching architecture.
Access: Pi0 is not publicly available. Physical Intelligence offers API access to select partners. The community reimplementation OpenPI (see below) provides an approximation using the published architecture details.
Best for: Teams with PI partnership agreements that need state-of-the-art performance and are willing to accept vendor lock-in.
GR00T N1 (NVIDIA)
GR00T N1 is NVIDIA's humanoid-focused foundation model, designed for the Isaac Lab ecosystem and optimized for Jetson Thor edge deployment.
Dual-system architecture: GR00T N1 uses two interconnected models: a "slow" VLA backbone (2-5 Hz) that interprets visual scenes and language instructions to produce high-level action plans, and a "fast" policy (200+ Hz) that converts plans to motor commands with reactive feedback. This mirrors the biological distinction between deliberate planning and reflexive motor control.
Sim-first training: Pre-trained on 1M+ episodes in Isaac Lab (physics simulation with domain randomization), then fine-tuned with 50K-100K real humanoid episodes. This sim-first approach works well for locomotion and whole-body balance but still requires significant real data for manipulation tasks.
Hardware ecosystem: Optimized for NVIDIA Jetson Thor (the humanoid-specific compute platform). Also runs on Jetson AGX Orin with reduced performance. GR00T SDK provides standardized interfaces for Figure, Agility Robotics, Apptronik, and 1X humanoids.
Best for: Humanoid robot projects, especially those already in the NVIDIA ecosystem. Not well-suited for tabletop manipulation arms.
SmolVLA (Hugging Face)
SmolVLA is Hugging Face's bet that VLA models can be made small enough for consumer hardware without sacrificing useful generalization. At 500M parameters total, it is the most accessible VLA for researchers and hobbyists.
Architecture: SigLIP vision encoder (400M) paired with SmolLM2 (135M parameter language model) and a diffusion action head. The small language model limits complex reasoning but still provides meaningful language conditioning for common task descriptions.
LeRobot integration: SmolVLA is a first-class citizen in the LeRobot framework. Training, evaluation, and deployment scripts are maintained as part of the LeRobot codebase. Dataset loading from Hugging Face Hub is native. This tight integration means you can go from data collection to deployed policy with a single toolchain.
Performance: On standard benchmarks (SIMPLER, Bridge V2 eval), SmolVLA achieves 60-75% of OpenVLA's success rate at 1/14th the parameter count. For single-task fine-tuning, the gap narrows to 5-10% with sufficient demonstrations (200+).
Best for: LeRobot users, hobbyists with consumer GPUs, education, and teams that need to run inference on Jetson Orin NX ($499) rather than Jetson AGX Orin ($1,999).
OpenPI (Community)
OpenPI is a community reimplementation of Physical Intelligence's Pi0 architecture, based on the published paper and reverse-engineered design choices. It provides the flow matching action head that distinguishes Pi0 from other VLAs, in an open-source package.
Architecture: SigLIP vision encoder with a LLaMA-based backbone (~3B parameters total) and flow matching action head. The action head uses 8-step flow matching for inference, achieving smooth continuous action predictions.
Training: Community-trained on a combination of Open X-Embodiment, DROID, and community-contributed datasets. Total training data is smaller than the original Pi0 dataset, which limits generalization.
Performance: Achieves approximately 70-80% of reported Pi0 performance on comparable benchmarks. The flow matching head provides notably smoother action trajectories than diffusion-based alternatives (Octo, SmolVLA), which some users prefer for contact-rich tasks.
Best for: Researchers interested in flow matching architectures who want open-source access and the ability to inspect and modify the full model.
Hardware Compatibility
| Model | WidowX / ALOHA | OpenArm 101 | Franka | Unitree G1 | Koch / SO-100 |
|---|---|---|---|---|---|
| Octo | Native | Fine-tune | Native | Fine-tune | Fine-tune |
| OpenVLA | Native | Fine-tune | Native | Fine-tune | Fine-tune |
| Pi0 | Supported | Via PI API | Supported | Via PI API | Not supported |
| GR00T N1 | Not targeted | Not targeted | Not targeted | Native | Not targeted |
| SmolVLA | Native | Native (LeRobot) | Fine-tune | Fine-tune | Native (LeRobot) |
| OpenPI | Community | Fine-tune | Community | Fine-tune | Community |
"Native" = pre-trained on data from this embodiment; works zero-shot or with minimal fine-tuning. "Fine-tune" = requires 50-200 demonstrations on this specific robot. "Community" = community-maintained integration; may require manual setup.
Training Your Own VLA
Most teams should fine-tune an existing VLA rather than train from scratch. Here is the practical decision tree:
Fine-tuning (recommended for most teams)
- Choose your base model: Octo for speed, OpenVLA for capability, SmolVLA for LeRobot integration
- Collect 50-200 demonstrations of your target task on your specific robot platform (see data collection guide)
- Format data: Convert to the model's expected format (LeRobot dataset format for SmolVLA/Octo, RLDS for OpenVLA)
- Fine-tune: Octo: 30 min on 1 GPU. SmolVLA: 1-2 hours on 1 GPU. OpenVLA: 4-8 hours on 8 GPUs (or 1-2 GPUs with LoRA)
- Evaluate: Run 50 evaluation episodes, measure success rate, analyze failure modes
Training from scratch (for specialized architectures)
Training from scratch requires 100K+ demonstrations and significant compute (64-256 GPU-hours for Octo-scale, 1,000+ GPU-hours for OpenVLA-scale). This is appropriate only if you have a novel architecture, proprietary large-scale data, or requirements that existing models cannot meet.
Inference Speed Benchmarks
| Model | A100 (80GB) | Jetson AGX Orin | Jetson Orin NX | RTX 4090 | Action Chunking Needed? |
|---|---|---|---|---|---|
| Octo | 40+ Hz | 15-20 Hz | 8-12 Hz | 30+ Hz | No (built-in) |
| OpenVLA | 5-8 Hz | 1.5-3 Hz | OOM | 4-6 Hz | Yes (10-20 steps) |
| Pi0 | 10-15 Hz | 5-8 Hz | Not tested | 8-12 Hz | Optional |
| SmolVLA | 25+ Hz | 10-15 Hz | 5-8 Hz | 20+ Hz | No (built-in) |
| OpenPI | 8-12 Hz | 3-5 Hz | OOM | 6-10 Hz | Optional |
Benchmarks measured with FP16 precision, batch size 1, single camera input. Real-world performance varies with camera resolution and number of views.
When to Use Which VLA
Use this decision tree to choose your model:
- Need results this week with minimal compute? → Octo. Fine-tune in 30 minutes, deploy same day.
- Need language-conditioned multi-task with strong generalization? → OpenVLA. Best open-source language grounding.
- Working with LeRobot and want native ecosystem support? → SmolVLA. Tightest LeRobot integration.
- Building a humanoid robot application? → GR00T N1. Purpose-built for whole-body humanoid control.
- Want the smoothest action trajectories (contact-rich tasks)? → OpenPI. Flow matching produces smoother motions than diffusion.
- Need maximum performance and have a PI partnership? → Pi0. State-of-the-art, but proprietary.
- Have a very specific single task and don't need language conditioning? → Consider ACT or Diffusion Policy instead of a VLA. They are simpler, faster to train, and often achieve higher single-task success rates.
Collecting VLA-Compatible Training Data
VLA training data has specific requirements that differ from data for simpler policies:
Camera Requirements
- Resolution: Minimum 640x480, recommended 1280x720. VLA vision encoders downsample to 224x224 or 336x336 internally, but higher source resolution preserves detail during downsampling.
- Frame rate: 30 fps for camera frames, synchronized with 50 Hz joint state recording. The mismatch is handled by the data loader (nearest-neighbor interpolation).
- Views: Minimum 1 wrist camera + 1 external camera. Most VLAs accept 2-4 views. Additional views improve performance but increase storage and inference cost.
- Calibration: Camera intrinsics and extrinsics should be recorded but are not required by most VLA architectures (they learn visual features directly from pixels).
Action Format
- Joint positions: Record absolute joint positions at 50 Hz. This is the most universal format — all VLAs can consume it.
- End-effector pose: Also record EEF position + quaternion orientation if available. Some models (Octo, Pi0) can use EEF-space actions directly.
- Gripper state: Binary (open/closed) or continuous gripper position. Record at the same 50 Hz frequency as joint data.
Language Annotations
- Every episode needs a natural language task description: "pick up the red cup and place it on the saucer"
- Use specific, descriptive language. "Do the task" is useless; "pick up the leftmost object" is useful
- Vary language across episodes for the same task to improve language generalization: "grab the mug," "pick up the coffee cup," "take the cup from the table"
Output Formats
- HDF5: Standard for ACT, Diffusion Policy, and direct Octo consumption
- RLDS (TensorFlow Datasets): Required for OpenVLA pre-training data format
- LeRobot format: Parquet + video files on Hugging Face Hub. Native for SmolVLA
SVRC's data collection service outputs in all three formats by default. Every dataset delivered includes HDF5, RLDS, and LeRobot-compatible exports, so you can train on any VLA without reformatting.