Imitation Learning for Robots: From Behavior Cloning to Diffusion Policies

What Is Imitation Learning?

Imitation learning (IL) is a family of algorithms that teach a robot to act by learning from expert demonstrations — no reward function required. Formally, we have a dataset of demonstrations D = {(s_t, a_t)} where s_t is the state (sensor readings) at time t and a_t is the action the expert took. The goal is to learn a policy π(a | s) that maps states to actions.

The appeal of IL is practical: it is far easier for a human to show a robot what to do than to hand-engineer a reward function or write motion primitives. IL powers the majority of recent advances in dexterous manipulation.

Behavior Cloning: The Baseline

Behavior Cloning (BC) is the simplest IL algorithm. It treats the problem as supervised learning: given state s, predict action a. You train a neural network by minimizing the mean squared error between predicted and expert actions across all demonstration timesteps.

BC is fast to train, easy to implement, and works surprisingly well for short-horizon tasks. A standard BC pipeline can be running in an afternoon: collect 50 demonstrations, train a ResNet+MLP for 100 epochs, deploy.

The critical limitation: compounding errors. During training, the network only sees states that appear in the demonstrations. At test time, even a tiny prediction error moves the robot to a slightly different state — one it has never seen. The next prediction is then less accurate, the error grows, and within a few seconds the robot is in a completely out-of-distribution state from which it cannot recover.

The Distribution Shift Problem

Distribution shift is the core challenge in imitation learning. The training distribution p_demo(s) (states seen during demonstrations) and the test distribution p_π(s) (states visited by the learned policy) are different. Because BC minimizes error on the training distribution, it gives no guarantees about behavior on the test distribution.

The mathematical insight: for a policy with per-step error ε over a horizon of T steps, the total expected error grows as O(T²ε). Doubling the task length quadruples the compounding error. This is why BC works for 5-second tasks but fails for 30-second tasks without special handling.

DAgger: Fixing Distribution Shift

Dataset Aggregation (DAgger) addresses distribution shift iteratively:

Step 1: Train an initial BC policy on the original demonstrations.
Step 2: Deploy the learned policy in the real environment. The policy will visit new states.
Step 3: Have a human expert provide the correct action at every state the policy visits — even the bad ones it reached by mistake.
Step 4: Add these (state, corrected action) pairs to the dataset.
Step 5: Retrain the policy on the aggregated dataset. Repeat from Step 2.

After several rounds, the training distribution closely matches the states the policy actually visits, and compounding error is drastically reduced. DAgger is theoretically sound but expensive: it requires a human expert to be present during robot execution, watching and correcting actions in real time.

Modern Algorithms: ACT, Diffusion Policy, and IBC

Three algorithms dominate the current state of the art in robot imitation learning.

ACT — Action Chunking with Transformers: Instead of predicting a single action at each step, ACT predicts a chunk of K future actions at once (typically K=100 at 50Hz, so 2 seconds of motion). The robot executes the chunk, then re-plans. This breaks the compounding error cycle — errors only accumulate within a chunk, not across the whole episode. ACT uses a CVAE encoder to handle multi-modal demonstration data and a Transformer decoder to generate action sequences.
Diffusion Policy: Models the action distribution as a denoising diffusion process. Rather than predicting a single action, Diffusion Policy learns to iteratively denoise random noise into a plausible action trajectory. This naturally handles multi-modality — cases where multiple valid actions exist for a given state (e.g., grasping from left or right). A single BC network would average these solutions and predict something invalid; Diffusion Policy can represent both modes.
IBC — Implicit Behavioral Cloning: Trains an energy function E(s, a) rather than a direct action predictor. Inference finds the action that minimizes energy via gradient descent or MCMC sampling. Robust to multi-modal distributions but more computationally expensive at inference time.

Algorithm Comparison

Algorithm	Handles Multi-Modal	Action Horizon	Training Speed	Inference Speed	Best For
Behavior Cloning	No (averages modes)	Single step	Very fast	Very fast	Short tasks, simple grasps
DAgger	No	Single step	Fast per iter	Very fast	Tasks where expert can correct live
ACT	Partial (CVAE)	Chunk (K steps)	Fast	Fast	Bimanual, contact-rich tasks
Diffusion Policy	Yes	Horizon (T steps)	Slow	Moderate	Complex, multi-modal tasks
IBC	Yes	Single step	Moderate	Slow	Research; multi-modal with energy landscape

For practitioners starting out: begin with BC to validate your data pipeline and hardware setup. Move to ACT or Diffusion Policy once you need to tackle longer-horizon or contact-rich tasks. The choice between ACT and Diffusion Policy depends more on your task structure than on raw performance — both are excellent baselines in 2025.