The Problem with Naive Behavior Cloning
Standard behavior cloning (BC) trains a policy π(a|s) to predict the next action given the current observation. This works well for short, low-variability tasks, but breaks down in two critical ways as task complexity increases.
First, compounding errors: small prediction errors shift the robot to slightly unfamiliar states, which cause larger errors, which shift the state further — leading to catastrophic failure well before the task is complete. This is the classic DAgger problem. In practice, BC policies for table-top tasks often succeed at 80% during training rollouts (which stay on-distribution) but fail at 40–50% in deployment on the same task.
Second, mode averaging: when multiple valid strategies exist for the same task (e.g., grasping an object from the left or the right), a unimodal BC policy averages the demonstrations, producing an intermediate trajectory that executes neither strategy successfully.
What Is Action Chunking?
Action chunking, introduced in the ACT paper (Zhao et al., 2023), addresses compounding errors by predicting a sequence of H future actions from the current observation, then executing the first k actions before querying the policy again.
The policy π(a_{t:t+H} | s_t) outputs a chunk of H actions at once. With H=100 at 50 Hz control, a single query covers 2 seconds of motion. The key insight: by predicting a short-horizon plan rather than a single step, the policy leverages temporal context to commit to a coherent strategy, avoiding the averaging problem.
- Chunk size H: The number of future timesteps predicted in a single query. For tabletop manipulation at 50 Hz, H=100 (2 seconds) is the standard ACT configuration. Larger H covers more of the task but gives the policy less opportunity to react to unexpected states.
- Re-planning frequency k: The number of actions executed before querying the policy again. In ACT, k=H/4 to H (full chunk execution without re-planning). Temporal ensemble (weighted average of overlapping chunk predictions) further smooths execution.
- CVAE encoder: ACT uses a Conditional Variational Autoencoder to encode the style/mode of the entire demonstration, allowing the policy to commit to one strategy rather than averaging. This is the primary mechanism for handling multi-modal demonstrations.
Diffusion Policy: Chunking via Denoising
Diffusion Policy (Chi et al., 2023) achieves similar multi-modality handling through a different mechanism: modeling the action distribution as a diffusion process over a chunk of actions. The policy denoises a noisy action sequence conditioned on the current observation.
- U-Net diffusion: A convolutional U-Net denoises action chunks. Faster inference (~10–50 ms on GPU) than transformer-based diffusion. Preferred for tasks with structured, repetitive motions.
- Transformer diffusion: Better at capturing long-range dependencies in action sequences. Slower (~100–300 ms). Better suited for complex bimanual or multi-step tasks.
- Typical chunk size: H=8–16 for 10 Hz control policies; H=32–64 for 50 Hz policies. Diffusion policy re-plans every chunk by default.
Task Horizon Analysis
The right chunk size and re-planning frequency depends heavily on task horizon:
- Short-horizon tasks (<3 seconds): Simple pick-and-place, cup stacking. BC often works adequately. If using ACT, H=50–100 at 50 Hz covers the whole task in one or two queries. Diffusion with H=16–32 is also effective.
- Medium-horizon tasks (3–15 seconds): Peg insertion, cable routing, food assembly. This is ACT's sweet spot. H=100 covers 2 seconds; re-planning every 25–50 steps (0.5–1 second) keeps the policy reactive. Temporal ensemble recommended.
- Long-horizon tasks (>30 seconds): Full cooking workflows, multi-drawer assembly, room organization. Single-level chunking fails here — the chunk cannot cover the full task, but long chunks lose reactivity. Hierarchical approaches (high-level task planner + low-level chunked executor) are necessary. VLMs like RT-2 or OpenVLA handle high-level task decomposition; ACT or diffusion handles the low-level execution.
Training Data Requirements
Chunking changes your data requirements in subtle ways:
- Temporal consistency requirement: Action chunking assumes that consecutive timesteps in a demonstration are part of a coherent plan. Hesitations, corrections, and backtracking within a demonstration confuse the chunk predictor. For noisy tasks, action chunking requires approximately 2× more data than BC to achieve the same performance, because a larger fraction of demos must be high-quality.
- Multi-modal coverage: CVAE in ACT requires at least 20–30 demonstrations of each distinct strategy to learn a clean latent mode. If your task has 3 valid strategies, you need 60–90 demos just for mode coverage, before general policy quality.
- Episode length normalization: Variable-length episodes create challenges for fixed-chunk prediction. Consider padding episodes to a standard length or using chunk masking during training.
Performance Benchmarks
| Algorithm | ALOHA Insertion | ALOHA Transfer Cube | ALOHA Slot Battery | Inference Latency |
|---|---|---|---|---|
| Behavior Cloning (BC) | 45% | 62% | 30% | 2–5 ms |
| ACT (H=100) | 80% | 92% | 65% | 15–30 ms |
| Diffusion Policy (U-Net) | 76% | 88% | 60% | 10–50 ms |
| Diffusion Policy (Transformer) | 82% | 91% | 68% | 100–300 ms |
| ACT + temporal ensemble | 83% | 94% | 70% | 20–40 ms |
Results above from the original ACT paper (Zhao et al., 2023) and Chi et al. (2023) on the ALOHA bimanual manipulation benchmark. Your results will vary based on task, data quality, and hardware.
For recommendations on which algorithm fits your task and data situation, see the SVRC training platform or the foundation models survey.
Recommendations by Task Type
| Task Type | Recommended Algorithm | Chunk Size | Notes |
|---|---|---|---|
| Simple pick-and-place | BC or ACT | N/A or H=50 | BC sufficient if single strategy |
| Precision insertion | ACT or Diffusion-T | H=100 | Multi-modality in approach angle |
| Bimanual coordination | ACT | H=100–200 | Joint state space required |
| Long-horizon (>30s) | Hierarchical (VLM + ACT) | H=50–100 per subtask | Subtask decomposition required |
| Deformable objects | Diffusion Policy | H=32–64 | Diffusion handles uncertainty better |
| High-speed (>5 Hz query) | BC or U-Net Diffusion | H=8–16 | Transformer diffusion too slow |