1. What Is Action Chunking?
Action chunking is a technique in imitation learning where the policy predicts a sequence of future actions — a "chunk" — rather than a single next action. The robot executes part of this chunk, then re-queries the policy for a new overlapping chunk. The result is smoother, more temporally consistent trajectories that resist the compounding error problem that plagues single-step behavioral cloning (BC).
In standard BC, the policy predicts one action per control step. Each prediction error shifts the robot into a state the policy has never seen, causing the next prediction to be worse, and so on. Over a 200-step trajectory at 50 Hz (4 seconds of manipulation), even a 1% per-step error compounds to an ~87% probability of drifting off the demonstration distribution. Action chunking reduces this by committing to plans of k steps (typically 50–100 at 50 Hz, corresponding to 1–2 seconds of motion), cutting the number of independent decision points from 200 down to 200/k — as few as 2–4.
Chunk Size: The Most Important Hyperparameter
Chunk size k determines the trade-off between trajectory smoothness and environmental reactivity. Larger chunks produce smoother motions but cannot react to unexpected events (object shifted, obstacle appeared) until the current chunk finishes executing. Smaller chunks react faster but lose the smoothness advantage and re-introduce compounding error.
Recommended starting points by task type:
- Tabletop pick-and-place: k=50–100 (1–2 seconds). The environment is static; smoothness is more valuable than reactivity.
- Assembly and insertion: k=30–60. Contact-rich phases need more frequent re-planning to adjust for alignment errors.
- Bimanual coordination: k=50–80. Two arms need coordinated chunks; too-short chunks break coordination.
- Mobile manipulation: k=20–40 for the base, k=50–80 for the arm. The base needs faster obstacle reaction.
- Dynamic tasks (catching, dodging): k=10–20. Reactivity is critical; accept some smoothness loss.
Why Action Chunking Works: The Temporal Smoothness Argument
A single-step BC policy treats each time step independently. Even if the model perfectly captures the conditional distribution P(a_t | o_t), independence across steps means the trajectory lacks temporal coherence — successive actions can be individually correct but mutually inconsistent (e.g., accelerating and decelerating in alternating steps).
By predicting k steps simultaneously, the model must produce an internally coherent trajectory segment. The training loss penalizes the entire chunk jointly, forcing the network to learn smooth motion primitives rather than isolated frame-by-frame predictions. This is analogous to how language models produce more coherent text when generating multiple tokens at once versus sampling one token at a time.
2. ACT Architecture
ACT uses a Conditional Variational Autoencoder (CVAE) with a Transformer backbone. The architecture has three main components: a CVAE encoder (used only during training), a Transformer decoder (used during both training and inference), and a vision backbone.
CVAE Encoder (Training Only)
During training, the encoder processes the full demonstration — images, joint states, and ground-truth actions — and produces a latent style variable z (dimension 32) that captures demonstration-specific variation. Different operators may approach the same task from different angles or speeds; z encodes this variation.
The encoder is a Transformer that attends over the full action sequence and the current observation. It outputs a mean and log-variance parameterizing a Gaussian, from which z is sampled via the reparameterization trick. A KL divergence term (weighted by beta, typically 10–100) regularizes z toward a standard normal prior N(0, I).
At inference, z is set to zero (the prior mean), making the policy deterministic. The CVAE structure is purely a training aid: without it, the model averages across demonstration styles, producing motions that are the mean of all approaches and therefore none of them (mode averaging). With the CVAE, the model learns to decode each style separately, and the zero-mean z at inference produces the most "typical" execution.
Transformer Decoder (Training + Inference)
The decoder takes three inputs: the latent z (or zero), observation tokens from the vision backbone + proprioception, and k learnable positional queries (one per action in the chunk). Standard Transformer cross-attention lets the positional queries attend to the observation tokens, producing k action predictions in parallel.
Architecture details from the original ACT implementation:
- Encoder layers: 4 (processes observation + z into context tokens)
- Decoder layers: 7 (deeper because it does more work: generating the full action chunk)
- Hidden dimension: 512
- Attention heads: 8
- Feedforward dimension: 2,048
- Total parameters: ~40M (excluding vision backbone)
- Action output: k joint position targets (e.g., k=100 steps x 14 joints for bimanual ALOHA)
Vision Backbone
Each camera view is processed independently through a ResNet-18, producing a feature map (15x20x512 from a 480x640 input) that is flattened into 300 tokens per camera. With 2–4 cameras, the decoder receives 600–1,200 visual tokens.
The minimum viable camera setup is 2 views: one wrist-mounted camera (close-up manipulation detail) and one overhead camera (spatial context). Three cameras (1 wrist + 2 third-person at different angles) is the SVRC standard and consistently improves success rate by 5–15% on precision tasks compared to 2-camera setups.
3. Data Requirements
ACT learns from human teleoperation demonstrations stored as synchronized (observation, action) pairs. Data quality matters more than quantity — 50 clean demonstrations outperform 500 noisy ones.
Minimum Dataset Sizes by Task Complexity
| Task Type | Demonstrations Needed | Approx. Collection Time | Notes |
|---|---|---|---|
| Simple pick-and-place (fixed location) | 20–30 | 30 minutes | Lowest bar for proof-of-concept |
| Pick-and-place with pose variation | 50–80 | 1–2 hours | Must cover workspace uniformly |
| Multi-step manipulation | 100–150 | 3–4 hours | Each phase needs coverage |
| Bimanual coordination | 150–250 | 5–8 hours | Coordination patterns require more examples |
| Contact-rich assembly (insertion, screwing) | 200–400 | 8–16 hours | Force-sensitive phases need dense coverage |
Data Quality Checklist
- No pauses: Demonstrations should flow continuously. Pauses >0.5s teach the policy to predict "do nothing" chunks.
- Consistent speed: Mixing fast and slow demonstrations wastes CVAE latent capacity on speed variation.
- Clean start/end: Every episode starts from a similar neutral pose and ends clearly.
- Success only: Remove all failed demonstrations. ACT treats all training data as expert demonstrations.
- Camera consistency: Use rigid mounts. Even 5mm of camera shift between sessions degrades grasping accuracy.
Data Formats: HDF5 and RLDS
The standard format for ACT training data is HDF5, with each episode stored as a group containing datasets for images (n_cameras x H x W x 3, uint8), joint positions (qpos, float64), joint velocities (qvel, float64), and actions (float64). Timestamps are stored at the episode level. LeRobot additionally supports conversion to Parquet format for efficient streaming from Hugging Face Hub.
For cross-platform compatibility, RLDS (Reinforcement Learning Datasets) format — used by Google's Open X-Embodiment — stores episodes as TFRecord files with a standardized schema. LeRobot provides conversion utilities between HDF5, RLDS, and its native Parquet format. See our data format guide for detailed conversion instructions.
# HDF5 episode structure for ACT training
import h5py
import numpy as np
with h5py.File("episode_0042.hdf5", "r") as f:
# Camera images: (T, H, W, 3) uint8
cam_high = f["/observations/images/cam_high"][:] # overhead camera
cam_wrist = f["/observations/images/cam_wrist"][:] # wrist camera
# Joint state: (T, n_joints) float64
qpos = f["/observations/qpos"][:] # joint positions (radians)
qvel = f["/observations/qvel"][:] # joint velocities (rad/s)
# Actions: (T, action_dim) float64
actions = f["/action"][:] # target joint positions
print(f"Episode length: {len(qpos)} steps")
print(f"Control frequency: {f.attrs['control_freq']} Hz")
print(f"Camera resolution: {cam_high.shape[1]}x{cam_high.shape[2]}")
# Typical output:
# Episode length: 400 steps
# Control frequency: 50 Hz
# Camera resolution: 480x640
4. Training ACT with LeRobot
LeRobot (Hugging Face) is the most actively maintained open-source framework for training ACT policies. It provides standardized data loading, training loops, evaluation scripts, and pre-built configs for common robot platforms including ALOHA, Koch v1.1, and OpenArm.
Installation
# Clone and install LeRobot
git clone https://github.com/huggingface/lerobot.git
cd lerobot
pip install -e ".[act]"
# Verify GPU is available
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0)}')"
Training Command
# Full ACT training command
python lerobot/scripts/train.py \
--policy.name=act \
--dataset.repo_id=svrc/openarm_pick_place \
--policy.chunk_size=50 \
--policy.n_obs_steps=1 \
--policy.dim_model=512 \
--policy.n_heads=8 \
--policy.n_encoder_layers=4 \
--policy.n_decoder_layers=7 \
--policy.latent_dim=32 \
--policy.kl_weight=10.0 \
--training.learning_rate=1e-5 \
--training.batch_size=8 \
--training.num_epochs=5000 \
--training.eval_freq=500 \
--training.save_freq=1000 \
--device=cuda \
--output_dir=outputs/act_openarm_pick_place/
Hyperparameter Reference
| Hyperparameter | Default | Search Range | Effect of Increasing | When to Increase |
|---|---|---|---|---|
| chunk_size (k) | 100 | 20–200 | Smoother trajectories, less reactive | Smooth tasks (wiping, pouring) |
| latent_dim (z) | 32 | 16–64 | More expressive style capture | High demonstration variance |
| kl_weight (beta) | 10 | 1–100 | Stronger regularization | Small datasets (<50 demos) |
| learning_rate | 1e-5 | 5e-6–5e-5 | Faster convergence, instability risk | Large datasets (>200 demos) |
| temporal_ensemble_temp | 10 | 5–50 | Slower blending, more inertia | Smoother tasks |
| n_cameras | 2 | 1–4 | Richer spatial info, more compute | 3D reasoning tasks |
| backbone | ResNet-18 | ResNet-18/34/50 | Better visual features | Cluttered or varying-light scenes |
| num_epochs | 3000 | 1000–8000 | Better fit, overfitting risk | More demos or complex tasks |
GPU Requirements
- Minimum: 16 GB VRAM (RTX 4060 Ti, RTX 3090). Batch size 4–8. Training time: 4–8 hours for 200 episodes.
- Recommended: 24 GB VRAM (RTX 4090). Batch size 16. Training time: 2–4 hours for 200 episodes.
- Fast iteration: 40–80 GB (A100, H100). Batch size 32–64. Training time: 30–90 minutes. Enables rapid hyperparameter sweeps.
5. Training ACT with ACT+ (Original Repository)
The original ACT implementation from Tony Zhao's repository (tonyzhaozh/act) remains a valid training path, especially for teams already using the ALOHA hardware stack. ACT+ extends the original with proprioceptive history (last 2–5 joint states as input) for better velocity estimation.
# Training with the original ACT repo
git clone https://github.com/tonyzhaozh/act.git
cd act
pip install -r requirements.txt
# Train on custom data
python train.py \
--task_name openarm_pick_place \
--ckpt_dir checkpoints/openarm_pp \
--policy_class ACT \
--kl_weight 10 \
--chunk_size 50 \
--hidden_dim 512 \
--batch_size 8 \
--dim_feedforward 2048 \
--num_epochs 5000 \
--lr 1e-5 \
--seed 0 \
--num_steps 400 \
--camera_names cam_high cam_wrist
LeRobot vs. original repo: LeRobot is better maintained (active community, regular updates, multi-policy support) and handles data loading and evaluation automatically. The original repo gives more direct control and is simpler to modify for research experiments. For production deployments, we recommend LeRobot; for research forks, the original repo is easier to hack.
6. Deploying on Real Hardware
Deploying an ACT policy requires meeting real-time control constraints: the inference loop must run at the robot's control frequency (typically 50 Hz = 20ms per cycle) while processing camera images and computing action chunks.
Inference Loop Architecture
import torch
import numpy as np
from collections import deque
class ACTInferenceLoop:
def __init__(self, policy, chunk_size=50, temporal_temp=10.0, query_freq=5):
self.policy = policy
self.chunk_size = chunk_size
self.temporal_temp = temporal_temp
self.query_freq = query_freq # re-query every N steps
self.action_queue = deque(maxlen=chunk_size)
self.step_count = 0
def get_action(self, images, qpos):
"""Returns a single action with temporal ensembling."""
if self.step_count % self.query_freq == 0:
# Get new action chunk from policy
with torch.no_grad():
obs = {
"images": torch.tensor(images).unsqueeze(0).cuda(),
"qpos": torch.tensor(qpos).unsqueeze(0).cuda(),
}
chunk = self.policy(obs) # (1, chunk_size, action_dim)
chunk = chunk.squeeze(0).cpu().numpy()
self.action_queue.append({
"actions": chunk,
"generated_at": self.step_count
})
# Temporal ensembling: blend overlapping chunks
action = np.zeros_like(self.action_queue[0]["actions"][0])
total_weight = 0.0
for entry in self.action_queue:
offset = self.step_count - entry["generated_at"]
if offset < self.chunk_size:
weight = np.exp(-offset / self.temporal_temp)
action += weight * entry["actions"][offset]
total_weight += weight
action /= total_weight
self.step_count += 1
return action
Latency Budget at 50 Hz
At 50 Hz control, each cycle has a 20ms budget. A typical breakdown:
- Camera capture + preprocessing: 2–5ms (USB3 cameras, resize to 480x640)
- ACT inference: 2–12ms (GPU-dependent; see benchmark table below)
- Action interpolation + command send: 1–2ms
- Safety checks: 1ms
- Margin: 0–14ms
Inference Benchmarks by Platform
| Hardware | ACT Inference (ms) | Max Control Rate (Hz) | Suitable For |
|---|---|---|---|
| NVIDIA A100 (80GB) | 1.8 | 555 | Cloud/datacenter inference |
| NVIDIA RTX 4090 | 2.5 | 400 | Desktop workstation |
| NVIDIA RTX 4070 | 4.2 | 238 | Budget desktop |
| Jetson AGX Orin 64GB | 12 | 83 | Onboard robot compute |
| Jetson Orin NX 16GB | 22 | 45 | Mid-range edge |
| Jetson Orin Nano 8GB | 40 | 25 | Minimum viable edge |
| Apple M3 Pro 18GB | 8 | 125 | Development/prototyping |
Temporal Ensemble Implementation Details
The temporal ensemble uses an exponential decay weighting: for a prediction from a chunk generated t steps ago, the weight is w(t) = exp(-t / temperature). With the default temperature of 10 at 50 Hz, predictions from the current chunk are weighted ~1.0 while predictions from a chunk generated 20 steps ago are weighted ~0.14. This produces smooth blending between consecutive chunk predictions, eliminating the jitter that would occur at chunk boundaries with naive sequential execution.
7. OpenArm 101 + ACT: Specific Setup Steps
The OpenArm 101 ($4,500 single arm, $9,000 bimanual leader-follower station) is designed for imitation learning data collection and ACT deployment. Here is the complete setup path:
Hardware Setup
- Mount cameras: Attach the wrist camera (Intel RealSense D405) to the OpenArm wrist bracket. Mount the overhead camera (RealSense D435) on the provided gantry at 60cm above the workspace center. Optional: mount a third camera at 45 degrees for improved spatial coverage.
- Connect servos: OpenArm uses Feetech STS3215 servos (same as ALOHA). Connect via the U2D2 serial interface. Verify all 6 joints respond:
python openarm/scan_servos.py. - Calibrate joint limits: Run
python openarm/calibrate.pyto record joint zero positions and soft limits. This writesconfig/joint_calibration.json. - Test teleoperation: With the leader arm connected, run
python openarm/teleop_test.py. The follower should mirror the leader's motion in real time at 50 Hz.
Data Collection for ACT
# Record demonstrations on OpenArm
python lerobot/scripts/control_robot.py record \
--robot-path lerobot/configs/robot/openarm.yaml \
--fps 50 \
--repo-id svrc/openarm_pick_place \
--num-episodes 50 \
--warmup-time-s 3 \
--episode-time-s 15 \
--reset-time-s 10 \
--push-to-hub 1
Deploy ACT on OpenArm
# Deploy trained policy
python lerobot/scripts/control_robot.py replay \
--robot-path lerobot/configs/robot/openarm.yaml \
--policy-path outputs/act_openarm_pick_place/checkpoints/best/ \
--fps 50 \
--num-episodes 20
8. Common Failures and Fixes
We have trained and deployed ACT policies on 20+ manipulation tasks at SVRC. Here are the failure patterns we see most often and their remedies.
| Symptom | Root Cause | Fix |
|---|---|---|
| Policy outputs zero or constant actions | KL weight (beta) too high. Model ignores observations and outputs the mean action. | Reduce beta by 5–10x. Monitor KL divergence: it should stabilize at 0.5–5.0 nats, not near zero. |
| Jerky execution at chunk boundaries | Temporal ensemble temperature too low, or ensemble not implemented | Increase temperature from 10 to 20–30. Verify overlapping chunk execution in your control loop. |
| Good training loss, poor real-world performance | Camera mismatch between training and deployment | Check image resize pipeline, camera FOV, crop regions. Use identical camera mount positions. |
| Works for 2 seconds then drifts | Chunk size too large for task dynamics | Reduce chunk size by 50%. The policy is committing to outdated plans. |
| Inconsistent between evaluation runs | Too few demonstrations (<30) | Collect more data. With small datasets, performance is highly sensitive to train/val split. |
| KL divergence collapses to zero | Beta too low. Encoder encodes everything in z; decoder ignores observations. | Increase beta until KL stabilizes at 0.5–5.0 nats. |
| Mode averaging: robot moves to "average" of multiple strategies | Multi-modal demonstrations with single-mode CVAE | Standardize demonstration strategy, remove minority approaches, or switch to Diffusion Policy. |
| Sim-to-real gap: policy works in sim but fails on real robot | Contact dynamics, friction, and visual domain mismatch | Fine-tune on 50–100 real demonstrations. Apply domain randomization during sim training. |
9. Benchmarks: ACT vs Diffusion Policy vs Behavioral Cloning
The following benchmarks are from published results and our own evaluations at SVRC on standardized manipulation tasks using the ALOHA and OpenArm platforms.
| Dimension | ACT | Diffusion Policy | Behavioral Cloning (MLP) |
|---|---|---|---|
| Success rate (50 demos, pick-place) | 82–90% | 75–85% | 40–60% |
| Success rate (200 demos, pick-place) | 90–95% | 92–97% | 65–80% |
| Success rate (bimanual insertion) | 80–88% | 72–82% | 15–30% |
| Inference latency (RTX 4090) | 2.5ms | 15ms (DDIM) | 0.5ms |
| Inference latency (Jetson Orin Nano) | 40ms (25 Hz) | 180ms (5.5 Hz) | 5ms (200 Hz) |
| Training time (200 episodes, A100) | ~4 hours | ~8 hours | ~1 hour |
| Model size | ~40M params | ~80M params | ~5M params |
| Multi-modal handling | Limited (CVAE) | Excellent (diffusion) | None (mode averaging) |
| Trajectory smoothness | Excellent | Very good | Poor (jerky) |
| Data efficiency (<50 demos) | Good | Moderate | Poor |
| Edge deployment feasibility | Excellent | Challenging | Trivial |
10. When to Use ACT vs Other Methods
Use ACT When:
- You have 50–200 demonstrations and need a fast path to deployment.
- Your task has a relatively deterministic strategy (one clear way to do it).
- You are deploying on edge hardware (Jetson) where inference speed matters.
- You are doing bimanual manipulation (ACT was designed for ALOHA).
- You want the simplest possible training pipeline with minimal hyperparameter tuning.
- You need real-time control at 50 Hz on compute-constrained platforms.
Use Diffusion Policy When:
- Your demonstrations contain multiple valid strategies for the same task (different operators, ambiguous grasps).
- You need sub-millimeter precision on assembly or insertion tasks.
- You have abundant data (200+ demonstrations) and compute budget.
- You can tolerate higher inference latency (cloud inference or desktop GPU).
Use VLAs (OpenVLA, pi0) When:
- You need language-conditioned multi-task execution ("pick the red cup", "stack the blue blocks").
- You want to leverage pre-trained vision-language representations for generalization.
- You are doing mobile manipulation with large state spaces.
- You have access to multi-GPU training infrastructure (4–8 A100s).
Stay with Simple BC When:
- You are prototyping and want the fastest possible training iteration.
- Your task is very short (<20 steps) where compounding error is not a problem.
- You are deploying on extremely constrained hardware (microcontroller-class).