Imitation Learning for Robots: ACT, Diffusion Policy, and VLA Models Explained (2026)

What Is Imitation Learning?

Imitation learning trains a robot policy by showing it examples of desired behavior. Instead of manually programming trajectories or designing reward functions for reinforcement learning, you demonstrate the task — and the robot learns to replicate your behavior from the demonstration data.

The concept is straightforward: collect a dataset of (observation, action) pairs from human demonstrations, then train a neural network to predict the action given the observation. The trained network becomes the robot's policy, running in real time to control the robot autonomously.

In practice, imitation learning is the most reliable method for getting a robot to perform a new manipulation task in 2026. Reinforcement learning requires millions of environment interactions and carefully shaped reward functions. Classical motion planning requires explicit geometric models of every object. Imitation learning requires only demonstrations — and with modern architectures, surprisingly few of them.

Brief History: From Behavior Cloning to Modern Approaches

Behavior cloning (BC) is the simplest form of imitation learning: supervised learning that directly maps observations to actions. Train a neural network on (image, joint_position) → (next_action) pairs using standard regression loss. BC works well when the demonstration data is consistent and covers the distribution of states the robot will encounter during deployment. It fails when the robot drifts off the demonstrated trajectory — a problem called compounding error or distribution shift.

DAgger (Dataset Aggregation) addresses distribution shift by iteratively collecting new demonstrations at states the robot actually visits (not just the states the human demonstrator visited). After training an initial BC policy, you deploy it, let it drift, then have the human correct the resulting states. These corrections are added to the training set. DAgger is theoretically elegant but operationally expensive — it requires an expert available during every training iteration.

GAIL (Generative Adversarial Imitation Learning) combines imitation learning with adversarial training: a discriminator learns to distinguish robot behavior from human behavior, and the policy is trained to fool the discriminator. GAIL handles multi-modal demonstrations better than BC but requires online environment interaction (typically in simulation) and is notoriously difficult to tune. It sees limited use in real-robot settings.

The three approaches that dominate in 2026 — ACT, Diffusion Policy, and VLA models — all build on the behavior cloning paradigm but add architectural innovations that address its core limitations.

The Three Main Approaches in 2026

Action Chunking Transformer (ACT)

ACT, introduced by Tony Zhao et al. at Stanford in 2023, is the workhorse of imitation learning for tabletop manipulation. It predicts a chunk (sequence) of future actions rather than a single next action, which directly addresses the compounding error problem in behavior cloning.

How It Works

ACT uses a CVAE (Conditional Variational Autoencoder) architecture with a Transformer backbone. The encoder takes the current observation (camera images + joint positions) and encodes it into a latent representation. The decoder takes this latent code and autoregressively generates a sequence of k future actions (typically k=100, representing 2 seconds at 50 Hz control).

The key insight is temporal smoothing through chunking: by predicting 100 future actions at once, the model must produce a coherent trajectory, not just the immediate next action. This acts as an implicit regularizer that produces smooth, consistent motions even when individual demonstration trajectories vary.

During deployment, ACT uses temporal ensembling: at each timestep, it generates a new 100-step action chunk, but the executed action is a weighted average of the current chunk and the previous chunks' predicted actions for this timestep. This further smooths the trajectory and reduces jitter.

When to Use ACT

Tabletop manipulation with 1-2 robot arms (the original ALOHA use case)
Tasks with relatively deterministic strategies (pick-place, insertion, assembly)
When you have 50-200 demonstrations and need a fast training pipeline
When you need to deploy on modest hardware (runs on a single GPU, ~15 ms inference at 50 Hz)

Data Requirements

ACT is remarkably data-efficient for simple tasks. Published results:

Simple pick-and-place: 50 demonstrations → 85%+ success
Bimanual coordination (transfer between hands): 100 demonstrations → 80%+ success
Precise insertion: 200 demonstrations → 70%+ success
Complex multi-step tasks: 400-800 demonstrations needed for reliable performance

Quality matters more than quantity. 50 clean, consistent demonstrations often outperform 500 noisy ones.

Diffusion Policy

Diffusion Policy, introduced by Cheng Chi et al. at Columbia in 2023, applies the denoising diffusion framework (the same approach behind image generation models like Stable Diffusion) to robot action prediction.

How It Works

Instead of directly predicting actions, Diffusion Policy learns to iteratively denoise a random noise vector into a clean action trajectory. During training, it takes a clean action trajectory from a demonstration, adds varying levels of Gaussian noise, and trains a network to predict the noise. During inference, it starts with pure noise and iteratively denoises it over K steps (typically K=10-50) to produce a clean action trajectory.

The critical advantage is multi-modality: diffusion models naturally represent multiple valid solutions. If there are two equally good ways to perform a task (approach from the left vs. the right), diffusion policy can represent both modes without averaging them into a single (incorrect) middle trajectory. Behavior cloning with MSE loss averages modes, which is catastrophic for bimodal tasks.

When to Use Diffusion Policy

Tasks with multiple valid strategies (the data contains different approaches to the same goal)
Contact-rich tasks where precise force trajectories matter
Tasks requiring high trajectory smoothness (pouring, drawing, surface wiping)
When you have a GPU with enough compute for iterative denoising (30-80 ms inference)

Data Requirements

Diffusion Policy generally needs slightly more data than ACT but handles diverse data better:

Simple tasks: 100-200 demonstrations
Multi-modal tasks: 200-500 demonstrations (needs enough examples of each mode)
Complex manipulation: 500-1000 demonstrations

Key difference from ACT: Diffusion Policy is less sensitive to demonstration inconsistency because it can represent multimodal distributions. This makes it more forgiving of data collected by multiple operators with slightly different strategies.

Vision-Language-Action (VLA) Models

VLA models are the frontier of robot learning in 2026. They combine a pre-trained vision-language model (like PaLI, SigLIP, or LLaMA-based architectures) with action prediction, enabling robots to follow natural language instructions and generalize across tasks.

Key Models

OpenVLA (Berkeley, 2024): Open-source 7B parameter VLA built on LLaMA 2 + SigLIP. Trained on Open X-Embodiment data (970K episodes, 22 robot types). Fine-tunes to new robots with 100-500 demonstrations. Runs at ~5 Hz on a single A100 GPU.
pi0 (Physical Intelligence, 2024): Proprietary VLA model trained on cross-embodiment data including dexterous hands. Demonstrated zero-shot transfer to tasks not seen during training. Currently available only through Physical Intelligence's cloud API.
RT-2 (Google DeepMind, 2023): 55B parameter VLA based on PaLM-E. Achieved state-of-the-art generalization but requires massive compute (8 TPU v5e chips for inference). Not publicly available.
Octo (Berkeley, 2024): Lightweight (93M parameter) generalist policy trained on 800K+ episodes from Open X-Embodiment. Designed for fast fine-tuning on new robots with 50-200 demonstrations. Open-source and runs on a single consumer GPU.

When to Use VLA Models

When you need language-conditioned task execution ("pick up the red cup")
When you want to leverage pre-trained representations to reduce data requirements
When you need multi-task capability from a single model
When you have access to sufficient compute for fine-tuning (1-8 GPUs for OpenVLA/Octo) and inference

Data Requirements

VLA models benefit enormously from pre-training on large-scale cross-embodiment data. For fine-tuning to your specific robot and task:

OpenVLA fine-tuning: 100-500 demonstrations for a single task on a new robot
Octo fine-tuning: 50-200 demonstrations (smaller model adapts faster)
From-scratch VLA training: 10,000-100,000+ demonstrations (not practical for individual labs)

Approach	Inference Speed	Data Needed	Multi-Modal?	Language-Conditioned?	Compute (Inference)	Best For
ACT	50+ Hz	50-200	Limited	No	1 consumer GPU	Single-task, fast iteration
Diffusion Policy	10-30 Hz	100-500	Yes	No	1 consumer GPU	Multi-modal tasks, contact-rich
VLA (OpenVLA/Octo)	3-10 Hz	50-500 (fine-tune)	Yes	Yes	1 A100 or 4070+	Multi-task, language-conditioned

Training Pipeline Overview

Regardless of which approach you choose, the training pipeline follows the same high-level stages. The details of each stage matter enormously for final policy performance.

1. Data Collection

Collect human demonstrations via teleoperation. Use leader-follower for the highest quality data. Record at 50 Hz for joint data and 30 fps for cameras. Store in HDF5 format. See our data collection guide for the full protocol.

2. Data Preprocessing

Convert raw recordings to training-ready format. Key steps:

Image resizing: Resize camera frames to the model's expected input resolution (typically 224x224 or 256x256 for ViT-based encoders). Use anti-aliased resizing to preserve detail.
Action normalization: Normalize actions to [-1, 1] or [0, 1] range. Compute statistics (mean, std) from the training set. Apply the same normalization at deployment.
Episode filtering: Remove failed episodes, episodes with anomalous lengths (>3 standard deviations from mean), and episodes flagged during quality review.
Train/validation split: Hold out 10-15% of episodes for validation. Split by episode, not by frame — never put frames from the same episode in both train and validation sets.

3. Training

Train the policy on your dataset. Typical training configurations:

ACT: 2,000-5,000 epochs on a dataset of 100-200 episodes. Learning rate 1e-5 with cosine schedule. Batch size 8-16. Training time: 2-8 hours on a single RTX 4090.
Diffusion Policy: 500-2,000 epochs on 200-500 episodes. Learning rate 1e-4 with cosine schedule. DDPM with 100 diffusion steps for training, DDIM with 10-50 steps for inference. Training time: 4-16 hours on a single RTX 4090.
VLA fine-tuning: 20-100 epochs on 100-500 episodes. LoRA or full fine-tuning depending on model size. Learning rate 2e-5. Training time: 4-24 hours on 1-4 A100 GPUs.

4. Evaluation

Validation loss is a weak predictor of real-world performance. The only reliable evaluation is deploying the policy on the real robot and measuring task success rate. Run at least 20 evaluation trials to get a statistically meaningful success rate (95% CI is approximately +/-20% with 20 trials). Record evaluation episodes for failure analysis.

5. Deployment

Export the trained model to ONNX or TorchScript for deployment. Run inference on the robot's onboard GPU (NVIDIA Jetson Orin for embedded deployments) or a networked workstation. Monitor inference latency — if the policy cannot run at the robot's control frequency, use action chunking (predict multiple future actions per inference call) to reduce the required inference rate.

Common Failure Modes and How to Avoid Them

1. Compounding Error (Distribution Shift)

Symptom: The policy works for the first few seconds, then gradually drifts and fails. Cause: Small prediction errors compound over time, pushing the robot into states not represented in the training data. Fix: Use action chunking (ACT), add data augmentation (random image crops, color jitter), or collect additional demonstrations that start from slightly off-nominal positions (a form of DAgger).

2. Mode Averaging

Symptom: The robot moves toward a position between two valid strategies, reaching neither. Cause: The training data contains two different approaches and the loss function averages them. Fix: Use Diffusion Policy (handles multi-modality natively) or standardize the demonstration strategy so all demonstrations follow the same approach.

3. Overfitting to Scene Layout

Symptom: High success rate in the exact training setup; near-zero success when anything changes (object position, lighting, background). Cause: Insufficient scene diversity in training data. Fix: Systematically randomize object positions, lighting, and background during data collection. Apply aggressive image augmentation during training (random crop, color jitter, cutout).

4. Grasp Failure from Camera Mismatch

Symptom: The robot reaches correctly but misses the grasp. Cause: Camera extrinsics shifted between data collection and deployment (camera was bumped, table moved). Even 5 mm of camera shift degrades grasping. Fix: Permanently mount cameras with rigid fixtures. Verify camera extrinsics before every deployment session using a calibration target.

5. Slow or Jerky Motion

Symptom: The policy produces correct trajectories but the robot moves slowly or with visible jitter. Cause: Action normalization mismatch, inference latency too high, or temporal ensembling parameters poorly tuned. Fix: Verify normalization statistics match between training and deployment. Profile inference latency. For ACT, tune the temporal ensembling weight (default 0.01 in the ACT codebase is often too low — try 0.05-0.1).

Sim-to-Real Transfer Considerations

Training in simulation and deploying on real robots (sim-to-real transfer) is appealing because simulation data is essentially free. But the reality gap makes pure sim-to-real unreliable for manipulation tasks.

What Works

Sim pre-training + real fine-tuning: Train on 100,000 simulated episodes, then fine-tune on 50-200 real episodes. The sim pre-training provides a strong initialization; the real fine-tuning bridges the reality gap. This consistently outperforms training on real data alone when real data is scarce.
Domain randomization: Randomize visual properties (textures, lighting, camera position) and physical properties (friction, mass, joint damping) during simulation training. Forces the policy to be robust to variation, some of which overlaps with real-world variation.
Simulation for locomotion: Sim-to-real works much better for locomotion (walking, balancing) than for manipulation. Unitree's G1 and Go2 controllers are trained almost entirely in simulation using Isaac Gym / Isaac Lab.

What Does Not Work (Yet)

Pure sim-to-real for contact-rich manipulation: Simulation cannot accurately model the friction, deformation, and contact dynamics of real-world objects. Policies trained purely in sim consistently fail on real grasping tasks involving soft objects, thin objects, or precise insertion.
Simulated camera images as a substitute for real images: Despite advances in photorealistic rendering, policies trained on simulated images alone show 30-50% success rate degradation when deployed with real cameras. Neural rendering (NeRF-based domain adaptation) is improving but not yet production-ready.

Resources for Getting Started

SVRC provides the tools and data to start training robot policies today:

SVRC Datasets: Browse our collection of open robot manipulation datasets in HDF5 and LeRobot format. Includes single-arm pick-place, bimanual coordination, and dexterous manipulation tasks.
Pre-trained Models: Download ACT and Diffusion Policy checkpoints pre-trained on SVRC hardware. Fine-tune on your data for faster convergence.
Robotics Library: Step-by-step tutorials covering data collection, preprocessing, training, and deployment for ACT, Diffusion Policy, and OpenVLA.
OpenArm Learning Path: A structured 5-day curriculum that takes you from unboxing an OpenArm to deploying a trained imitation learning policy.
Research: Publications, technical reports, and benchmarks from the SVRC research team on imitation learning methods and robot manipulation.

Train Your First Robot Policy

SVRC provides the hardware, data, and software to go from zero to a deployed imitation learning policy. Start with our OpenArm learning path or browse open datasets for immediate training.

OpenArm Learning Path Browse Datasets

Deep Dives

Imitation Learning for Robots: A Practical Guide

What Is Imitation Learning?

Brief History: From Behavior Cloning to Modern Approaches

The Three Main Approaches in 2026

Action Chunking Transformer (ACT)

How It Works

When to Use ACT

Data Requirements

Diffusion Policy

How It Works

When to Use Diffusion Policy

Data Requirements

Vision-Language-Action (VLA) Models

Key Models

When to Use VLA Models

Data Requirements

Training Pipeline Overview

1. Data Collection

2. Data Preprocessing

3. Training

4. Evaluation

5. Deployment

Common Failure Modes and How to Avoid Them

1. Compounding Error (Distribution Shift)

2. Mode Averaging

3. Overfitting to Scene Layout

4. Grasp Failure from Camera Mismatch

5. Slow or Jerky Motion

Sim-to-Real Transfer Considerations

What Works

What Does Not Work (Yet)

Resources for Getting Started

Train Your First Robot Policy

Related Articles

ACT vs. Diffusion Policy: A Practical Guide to Choosing the Right Algorithm

ACT Policy Explained: Action Chunking with Transformers for Robot Learning

7 Common Mistakes in Robot Imitation Learning (and How to Fix Them)

Diffusion Policy for Robot Learning: What It Is and How to Use It

Imitation Learning for Robots: A Practical Guide

Vision-Language-Action Models Explained: How VLAs Power Modern Robots

RL vs. Imitation Learning for Robot Manipulation: A Decision Guide

ACT vs Diffusion Policy: Which Robot Learning Algorithm Should You Use? (2025)