What Is Imitation Learning?

Imitation learning (IL) is a family of algorithms that teach a robot to act by learning from expert demonstrations — no reward function required. Formally, we have a dataset of demonstrations D = {(s_t, a_t)} where s_t is the state (sensor readings) at time t and a_t is the action the expert took. The goal is to learn a policy π(a | s) that maps states to actions.

The appeal of IL is practical: it is far easier for a human to show a robot what to do than to hand-engineer a reward function or write motion primitives. IL powers the majority of recent advances in dexterous manipulation.

Behavior Cloning: The Baseline

Behavior Cloning (BC) is the simplest IL algorithm. It treats the problem as supervised learning: given state s, predict action a. You train a neural network by minimizing the mean squared error between predicted and expert actions across all demonstration timesteps.

BC is fast to train, easy to implement, and works surprisingly well for short-horizon tasks. A standard BC pipeline can be running in an afternoon: collect 50 demonstrations, train a ResNet+MLP for 100 epochs, deploy.

The critical limitation: compounding errors. During training, the network only sees states that appear in the demonstrations. At test time, even a tiny prediction error moves the robot to a slightly different state — one it has never seen. The next prediction is then less accurate, the error grows, and within a few seconds the robot is in a completely out-of-distribution state from which it cannot recover.

The Distribution Shift Problem

Distribution shift is the core challenge in imitation learning. The training distribution p_demo(s) (states seen during demonstrations) and the test distribution p_π(s) (states visited by the learned policy) are different. Because BC minimizes error on the training distribution, it gives no guarantees about behavior on the test distribution.

The mathematical insight: for a policy with per-step error ε over a horizon of T steps, the total expected error grows as O(T²ε). Doubling the task length quadruples the compounding error. This is why BC works for 5-second tasks but fails for 30-second tasks without special handling.

DAgger: Fixing Distribution Shift

Dataset Aggregation (DAgger) addresses distribution shift iteratively:

  • Step 1: Train an initial BC policy on the original demonstrations.
  • Step 2: Deploy the learned policy in the real environment. The policy will visit new states.
  • Step 3: Have a human expert provide the correct action at every state the policy visits — even the bad ones it reached by mistake.
  • Step 4: Add these (state, corrected action) pairs to the dataset.
  • Step 5: Retrain the policy on the aggregated dataset. Repeat from Step 2.

After several rounds, the training distribution closely matches the states the policy actually visits, and compounding error is drastically reduced. DAgger is theoretically sound but expensive: it requires a human expert to be present during robot execution, watching and correcting actions in real time.

Modern Algorithms: ACT, Diffusion Policy, and IBC

Three algorithms dominate the current state of the art in robot imitation learning.

  • ACT — Action Chunking with Transformers: Instead of predicting a single action at each step, ACT predicts a chunk of K future actions at once (typically K=100 at 50Hz, so 2 seconds of motion). The robot executes the chunk, then re-plans. This breaks the compounding error cycle — errors only accumulate within a chunk, not across the whole episode. ACT uses a CVAE encoder to handle multi-modal demonstration data and a Transformer decoder to generate action sequences.
  • Diffusion Policy: Models the action distribution as a denoising diffusion process. Rather than predicting a single action, Diffusion Policy learns to iteratively denoise random noise into a plausible action trajectory. This naturally handles multi-modality — cases where multiple valid actions exist for a given state (e.g., grasping from left or right). A single BC network would average these solutions and predict something invalid; Diffusion Policy can represent both modes.
  • IBC — Implicit Behavioral Cloning: Trains an energy function E(s, a) rather than a direct action predictor. Inference finds the action that minimizes energy via gradient descent or MCMC sampling. Robust to multi-modal distributions but more computationally expensive at inference time.

Algorithm Comparison

AlgorithmHandles Multi-ModalAction HorizonTraining SpeedInference SpeedBest For
Behavior CloningNo (averages modes)Single stepVery fastVery fastShort tasks, simple grasps
DAggerNoSingle stepFast per iterVery fastTasks where expert can correct live
ACTPartial (CVAE)Chunk (K steps)FastFastBimanual, contact-rich tasks
Diffusion PolicyYesHorizon (T steps)SlowModerateComplex, multi-modal tasks
IBCYesSingle stepModerateSlowResearch; multi-modal with energy landscape

For practitioners starting out: begin with BC to validate your data pipeline and hardware setup. Move to ACT or Diffusion Policy once you need to tackle longer-horizon or contact-rich tasks. The choice between ACT and Diffusion Policy depends more on your task structure than on raw performance — both are excellent baselines in 2025.