ACT Policy Explained: Action Chunking with Transformers for Robot Learning
ACT — Action Chunking with Transformers — became one of the most widely adopted imitation learning algorithms for dexterous manipulation after its publication by Tony Zhao and collaborators at Stanford. Here is a practical explanation of how it works and how to use it.
What Is ACT?
ACT is an imitation learning algorithm designed for fine-grained manipulation tasks where the robot must make smooth, coordinated movements based on visual observations. At inference time, ACT takes a sequence of images from the robot's cameras and the current joint state, and outputs a chunk of future actions — a short sequence of joint position targets — rather than a single next action. The robot executes this chunk, then re-queries the policy for the next chunk. This predict-many-steps-ahead design is the defining feature of ACT and the source of most of its advantages over simpler behavior cloning.
ACT was introduced in the context of the ALOHA bimanual manipulation system and demonstrated success on tasks previously considered out of reach for imitation learning: slotting a battery, opening a ziploc bag, threading a needle. Its core insight — that chunked action prediction reduces compounding errors and smooths trajectories — has since been adopted in numerous follow-on algorithms.
How Action Chunking Works
Standard behavior cloning (BC) trains a policy to predict the next single action given the current observation. At inference time, prediction errors accumulate: each small mistake shifts the robot's state slightly, putting it in a distribution the policy was not trained on, which causes the next prediction to be worse, and so on. This compounding error is the central failure mode of naive BC on fine manipulation tasks.
Action chunking breaks this cycle by predicting a sequence of k future actions — typically 50–100 steps at 50 Hz, corresponding to 1–2 seconds of motion. The policy commits to this plan and executes it before re-querying. Because the plan was generated from a single consistent observation, the trajectory is smooth and internally consistent. Temporal ensembling — averaging overlapping action chunks from multiple re-queries — further smooths execution and reduces jitter at the boundaries between chunks.
ACT Architecture
ACT uses a CVAE (Conditional Variational Autoencoder) architecture. During training, an encoder processes the entire demonstration trajectory — images, joint states, and actions — and produces a latent style variable z that captures the "style" of the demonstration (fast vs slow, left-leaning vs right-leaning approach, etc.). A transformer-based decoder then takes the current observation, the latent z, and positional encodings, and predicts the action chunk. At inference time, z is set to zero (the mean of the prior), making the policy deterministic given the observation.
The vision backbone is typically a ResNet-18 processing each camera view independently, with the resulting feature maps passed as tokens to the transformer decoder. Multiple camera views — wrist cameras plus overhead cameras — each contribute a token stream, giving the policy rich spatial information about the manipulation scene.
Data Requirements and What Constitutes Good Data
ACT works well with 50–200 demonstrations per task in most published results. However, data quality matters more than quantity. Demonstrations should be smooth and purposeful — the ACT policy will learn whatever motion pattern is in the data, including hesitations, corrections, and suboptimal approaches. SVRC's data collection standard requires operators to restart an episode rather than continue after a visible error, ensuring the training dataset contains only intentional, successful behaviors.
Camera consistency is also critical. If camera placement changes between recording sessions, the visual features the policy learned will no longer match the deployment setup. Use physical mounts rather than flexible arms, and log the camera calibration parameters with each dataset. SVRC's multi-camera recording pipeline enforces this automatically.
ACT vs Behavior Cloning: Results
On the original ALOHA tasks, ACT achieved success rates of 80–95% compared to 20–50% for standard BC on the same data. The improvement is most pronounced on tasks requiring precise timing, smooth coordination between two arms, and graceful recovery from small perturbations. On simpler pick-and-place tasks with forgiving tolerances, the gap between ACT and BC narrows. ACT also outperforms Diffusion Policy on tasks where execution speed matters, since diffusion-based policies require more computation per inference step.
Training ACT with SVRC Data
SVRC's data platform exports datasets in LeRobot-compatible HDF5 format, which is the standard input format for the open-source ACT training code. After downloading your dataset, training a baseline ACT policy requires a GPU with at least 16 GB VRAM and approximately 8 hours of training for a single task. SVRC engineering support is available to help teams configure training runs, tune chunk size and learning rate, and evaluate policy performance. For hardware to collect your own data, see our hardware catalog or explore robot leasing options.