1. What Is Action Chunking?

Action chunking is a technique in imitation learning where the policy predicts a sequence of future actions — a "chunk" — rather than a single next action. The robot executes part of this chunk, then re-queries the policy for a new overlapping chunk. The result is smoother, more temporally consistent trajectories that resist the compounding error problem that plagues single-step behavioral cloning (BC).

In standard BC, the policy predicts one action per control step. Each prediction error shifts the robot into a state the policy has never seen, causing the next prediction to be worse, and so on. Over a 200-step trajectory at 50 Hz (4 seconds of manipulation), even a 1% per-step error compounds to an ~87% probability of drifting off the demonstration distribution. Action chunking reduces this by committing to plans of k steps (typically 50–100 at 50 Hz, corresponding to 1–2 seconds of motion), cutting the number of independent decision points from 200 down to 200/k — as few as 2–4.

Chunk Size: The Most Important Hyperparameter

Chunk size k determines the trade-off between trajectory smoothness and environmental reactivity. Larger chunks produce smoother motions but cannot react to unexpected events (object shifted, obstacle appeared) until the current chunk finishes executing. Smaller chunks react faster but lose the smoothness advantage and re-introduce compounding error.

Recommended starting points by task type:

  • Tabletop pick-and-place: k=50–100 (1–2 seconds). The environment is static; smoothness is more valuable than reactivity.
  • Assembly and insertion: k=30–60. Contact-rich phases need more frequent re-planning to adjust for alignment errors.
  • Bimanual coordination: k=50–80. Two arms need coordinated chunks; too-short chunks break coordination.
  • Mobile manipulation: k=20–40 for the base, k=50–80 for the arm. The base needs faster obstacle reaction.
  • Dynamic tasks (catching, dodging): k=10–20. Reactivity is critical; accept some smoothness loss.

Why Action Chunking Works: The Temporal Smoothness Argument

A single-step BC policy treats each time step independently. Even if the model perfectly captures the conditional distribution P(a_t | o_t), independence across steps means the trajectory lacks temporal coherence — successive actions can be individually correct but mutually inconsistent (e.g., accelerating and decelerating in alternating steps).

By predicting k steps simultaneously, the model must produce an internally coherent trajectory segment. The training loss penalizes the entire chunk jointly, forcing the network to learn smooth motion primitives rather than isolated frame-by-frame predictions. This is analogous to how language models produce more coherent text when generating multiple tokens at once versus sampling one token at a time.

2. ACT Architecture

ACT uses a Conditional Variational Autoencoder (CVAE) with a Transformer backbone. The architecture has three main components: a CVAE encoder (used only during training), a Transformer decoder (used during both training and inference), and a vision backbone.

CVAE Encoder (Training Only)

During training, the encoder processes the full demonstration — images, joint states, and ground-truth actions — and produces a latent style variable z (dimension 32) that captures demonstration-specific variation. Different operators may approach the same task from different angles or speeds; z encodes this variation.

The encoder is a Transformer that attends over the full action sequence and the current observation. It outputs a mean and log-variance parameterizing a Gaussian, from which z is sampled via the reparameterization trick. A KL divergence term (weighted by beta, typically 10–100) regularizes z toward a standard normal prior N(0, I).

At inference, z is set to zero (the prior mean), making the policy deterministic. The CVAE structure is purely a training aid: without it, the model averages across demonstration styles, producing motions that are the mean of all approaches and therefore none of them (mode averaging). With the CVAE, the model learns to decode each style separately, and the zero-mean z at inference produces the most "typical" execution.

Transformer Decoder (Training + Inference)

The decoder takes three inputs: the latent z (or zero), observation tokens from the vision backbone + proprioception, and k learnable positional queries (one per action in the chunk). Standard Transformer cross-attention lets the positional queries attend to the observation tokens, producing k action predictions in parallel.

Architecture details from the original ACT implementation:

  • Encoder layers: 4 (processes observation + z into context tokens)
  • Decoder layers: 7 (deeper because it does more work: generating the full action chunk)
  • Hidden dimension: 512
  • Attention heads: 8
  • Feedforward dimension: 2,048
  • Total parameters: ~40M (excluding vision backbone)
  • Action output: k joint position targets (e.g., k=100 steps x 14 joints for bimanual ALOHA)

Vision Backbone

Each camera view is processed independently through a ResNet-18, producing a feature map (15x20x512 from a 480x640 input) that is flattened into 300 tokens per camera. With 2–4 cameras, the decoder receives 600–1,200 visual tokens.

The minimum viable camera setup is 2 views: one wrist-mounted camera (close-up manipulation detail) and one overhead camera (spatial context). Three cameras (1 wrist + 2 third-person at different angles) is the SVRC standard and consistently improves success rate by 5–15% on precision tasks compared to 2-camera setups.

3. Data Requirements

ACT learns from human teleoperation demonstrations stored as synchronized (observation, action) pairs. Data quality matters more than quantity — 50 clean demonstrations outperform 500 noisy ones.

Minimum Dataset Sizes by Task Complexity

Task TypeDemonstrations NeededApprox. Collection TimeNotes
Simple pick-and-place (fixed location)20–3030 minutesLowest bar for proof-of-concept
Pick-and-place with pose variation50–801–2 hoursMust cover workspace uniformly
Multi-step manipulation100–1503–4 hoursEach phase needs coverage
Bimanual coordination150–2505–8 hoursCoordination patterns require more examples
Contact-rich assembly (insertion, screwing)200–4008–16 hoursForce-sensitive phases need dense coverage

Data Quality Checklist

  • No pauses: Demonstrations should flow continuously. Pauses >0.5s teach the policy to predict "do nothing" chunks.
  • Consistent speed: Mixing fast and slow demonstrations wastes CVAE latent capacity on speed variation.
  • Clean start/end: Every episode starts from a similar neutral pose and ends clearly.
  • Success only: Remove all failed demonstrations. ACT treats all training data as expert demonstrations.
  • Camera consistency: Use rigid mounts. Even 5mm of camera shift between sessions degrades grasping accuracy.

Data Formats: HDF5 and RLDS

The standard format for ACT training data is HDF5, with each episode stored as a group containing datasets for images (n_cameras x H x W x 3, uint8), joint positions (qpos, float64), joint velocities (qvel, float64), and actions (float64). Timestamps are stored at the episode level. LeRobot additionally supports conversion to Parquet format for efficient streaming from Hugging Face Hub.

For cross-platform compatibility, RLDS (Reinforcement Learning Datasets) format — used by Google's Open X-Embodiment — stores episodes as TFRecord files with a standardized schema. LeRobot provides conversion utilities between HDF5, RLDS, and its native Parquet format. See our data format guide for detailed conversion instructions.

# HDF5 episode structure for ACT training
import h5py
import numpy as np

with h5py.File("episode_0042.hdf5", "r") as f:
    # Camera images: (T, H, W, 3) uint8
    cam_high = f["/observations/images/cam_high"][:]   # overhead camera
    cam_wrist = f["/observations/images/cam_wrist"][:] # wrist camera

    # Joint state: (T, n_joints) float64
    qpos = f["/observations/qpos"][:]    # joint positions (radians)
    qvel = f["/observations/qvel"][:]    # joint velocities (rad/s)

    # Actions: (T, action_dim) float64
    actions = f["/action"][:]            # target joint positions

    print(f"Episode length: {len(qpos)} steps")
    print(f"Control frequency: {f.attrs['control_freq']} Hz")
    print(f"Camera resolution: {cam_high.shape[1]}x{cam_high.shape[2]}")
    # Typical output:
    # Episode length: 400 steps
    # Control frequency: 50 Hz
    # Camera resolution: 480x640

4. Training ACT with LeRobot

LeRobot (Hugging Face) is the most actively maintained open-source framework for training ACT policies. It provides standardized data loading, training loops, evaluation scripts, and pre-built configs for common robot platforms including ALOHA, Koch v1.1, and OpenArm.

Installation

# Clone and install LeRobot
git clone https://github.com/huggingface/lerobot.git
cd lerobot
pip install -e ".[act]"

# Verify GPU is available
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0)}')"

Training Command

# Full ACT training command
python lerobot/scripts/train.py \
  --policy.name=act \
  --dataset.repo_id=svrc/openarm_pick_place \
  --policy.chunk_size=50 \
  --policy.n_obs_steps=1 \
  --policy.dim_model=512 \
  --policy.n_heads=8 \
  --policy.n_encoder_layers=4 \
  --policy.n_decoder_layers=7 \
  --policy.latent_dim=32 \
  --policy.kl_weight=10.0 \
  --training.learning_rate=1e-5 \
  --training.batch_size=8 \
  --training.num_epochs=5000 \
  --training.eval_freq=500 \
  --training.save_freq=1000 \
  --device=cuda \
  --output_dir=outputs/act_openarm_pick_place/

Hyperparameter Reference

HyperparameterDefaultSearch RangeEffect of IncreasingWhen to Increase
chunk_size (k)10020–200Smoother trajectories, less reactiveSmooth tasks (wiping, pouring)
latent_dim (z)3216–64More expressive style captureHigh demonstration variance
kl_weight (beta)101–100Stronger regularizationSmall datasets (<50 demos)
learning_rate1e-55e-6–5e-5Faster convergence, instability riskLarge datasets (>200 demos)
temporal_ensemble_temp105–50Slower blending, more inertiaSmoother tasks
n_cameras21–4Richer spatial info, more compute3D reasoning tasks
backboneResNet-18ResNet-18/34/50Better visual featuresCluttered or varying-light scenes
num_epochs30001000–8000Better fit, overfitting riskMore demos or complex tasks

GPU Requirements

  • Minimum: 16 GB VRAM (RTX 4060 Ti, RTX 3090). Batch size 4–8. Training time: 4–8 hours for 200 episodes.
  • Recommended: 24 GB VRAM (RTX 4090). Batch size 16. Training time: 2–4 hours for 200 episodes.
  • Fast iteration: 40–80 GB (A100, H100). Batch size 32–64. Training time: 30–90 minutes. Enables rapid hyperparameter sweeps.

5. Training ACT with ACT+ (Original Repository)

The original ACT implementation from Tony Zhao's repository (tonyzhaozh/act) remains a valid training path, especially for teams already using the ALOHA hardware stack. ACT+ extends the original with proprioceptive history (last 2–5 joint states as input) for better velocity estimation.

# Training with the original ACT repo
git clone https://github.com/tonyzhaozh/act.git
cd act
pip install -r requirements.txt

# Train on custom data
python train.py \
  --task_name openarm_pick_place \
  --ckpt_dir checkpoints/openarm_pp \
  --policy_class ACT \
  --kl_weight 10 \
  --chunk_size 50 \
  --hidden_dim 512 \
  --batch_size 8 \
  --dim_feedforward 2048 \
  --num_epochs 5000 \
  --lr 1e-5 \
  --seed 0 \
  --num_steps 400 \
  --camera_names cam_high cam_wrist

LeRobot vs. original repo: LeRobot is better maintained (active community, regular updates, multi-policy support) and handles data loading and evaluation automatically. The original repo gives more direct control and is simpler to modify for research experiments. For production deployments, we recommend LeRobot; for research forks, the original repo is easier to hack.

6. Deploying on Real Hardware

Deploying an ACT policy requires meeting real-time control constraints: the inference loop must run at the robot's control frequency (typically 50 Hz = 20ms per cycle) while processing camera images and computing action chunks.

Inference Loop Architecture

import torch
import numpy as np
from collections import deque

class ACTInferenceLoop:
    def __init__(self, policy, chunk_size=50, temporal_temp=10.0, query_freq=5):
        self.policy = policy
        self.chunk_size = chunk_size
        self.temporal_temp = temporal_temp
        self.query_freq = query_freq  # re-query every N steps
        self.action_queue = deque(maxlen=chunk_size)
        self.step_count = 0

    def get_action(self, images, qpos):
        """Returns a single action with temporal ensembling."""
        if self.step_count % self.query_freq == 0:
            # Get new action chunk from policy
            with torch.no_grad():
                obs = {
                    "images": torch.tensor(images).unsqueeze(0).cuda(),
                    "qpos": torch.tensor(qpos).unsqueeze(0).cuda(),
                }
                chunk = self.policy(obs)  # (1, chunk_size, action_dim)
                chunk = chunk.squeeze(0).cpu().numpy()
            self.action_queue.append({
                "actions": chunk,
                "generated_at": self.step_count
            })

        # Temporal ensembling: blend overlapping chunks
        action = np.zeros_like(self.action_queue[0]["actions"][0])
        total_weight = 0.0

        for entry in self.action_queue:
            offset = self.step_count - entry["generated_at"]
            if offset < self.chunk_size:
                weight = np.exp(-offset / self.temporal_temp)
                action += weight * entry["actions"][offset]
                total_weight += weight

        action /= total_weight
        self.step_count += 1
        return action

Latency Budget at 50 Hz

At 50 Hz control, each cycle has a 20ms budget. A typical breakdown:

  • Camera capture + preprocessing: 2–5ms (USB3 cameras, resize to 480x640)
  • ACT inference: 2–12ms (GPU-dependent; see benchmark table below)
  • Action interpolation + command send: 1–2ms
  • Safety checks: 1ms
  • Margin: 0–14ms

Inference Benchmarks by Platform

HardwareACT Inference (ms)Max Control Rate (Hz)Suitable For
NVIDIA A100 (80GB)1.8555Cloud/datacenter inference
NVIDIA RTX 40902.5400Desktop workstation
NVIDIA RTX 40704.2238Budget desktop
Jetson AGX Orin 64GB1283Onboard robot compute
Jetson Orin NX 16GB2245Mid-range edge
Jetson Orin Nano 8GB4025Minimum viable edge
Apple M3 Pro 18GB8125Development/prototyping

Temporal Ensemble Implementation Details

The temporal ensemble uses an exponential decay weighting: for a prediction from a chunk generated t steps ago, the weight is w(t) = exp(-t / temperature). With the default temperature of 10 at 50 Hz, predictions from the current chunk are weighted ~1.0 while predictions from a chunk generated 20 steps ago are weighted ~0.14. This produces smooth blending between consecutive chunk predictions, eliminating the jitter that would occur at chunk boundaries with naive sequential execution.

Critical deployment tip: Always use temporal ensembling in production. Without it, the robot exhibits visible jitter at chunk boundaries — every k/query_freq steps, the robot suddenly shifts to a new plan. With ensembling, transitions are smooth and imperceptible. The computational overhead is negligible (a few microseconds of weighted averaging).

7. OpenArm 101 + ACT: Specific Setup Steps

The OpenArm 101 ($4,500 single arm, $9,000 bimanual leader-follower station) is designed for imitation learning data collection and ACT deployment. Here is the complete setup path:

Hardware Setup

  1. Mount cameras: Attach the wrist camera (Intel RealSense D405) to the OpenArm wrist bracket. Mount the overhead camera (RealSense D435) on the provided gantry at 60cm above the workspace center. Optional: mount a third camera at 45 degrees for improved spatial coverage.
  2. Connect servos: OpenArm uses Feetech STS3215 servos (same as ALOHA). Connect via the U2D2 serial interface. Verify all 6 joints respond: python openarm/scan_servos.py.
  3. Calibrate joint limits: Run python openarm/calibrate.py to record joint zero positions and soft limits. This writes config/joint_calibration.json.
  4. Test teleoperation: With the leader arm connected, run python openarm/teleop_test.py. The follower should mirror the leader's motion in real time at 50 Hz.

Data Collection for ACT

# Record demonstrations on OpenArm
python lerobot/scripts/control_robot.py record \
  --robot-path lerobot/configs/robot/openarm.yaml \
  --fps 50 \
  --repo-id svrc/openarm_pick_place \
  --num-episodes 50 \
  --warmup-time-s 3 \
  --episode-time-s 15 \
  --reset-time-s 10 \
  --push-to-hub 1

Deploy ACT on OpenArm

# Deploy trained policy
python lerobot/scripts/control_robot.py replay \
  --robot-path lerobot/configs/robot/openarm.yaml \
  --policy-path outputs/act_openarm_pick_place/checkpoints/best/ \
  --fps 50 \
  --num-episodes 20

8. Common Failures and Fixes

We have trained and deployed ACT policies on 20+ manipulation tasks at SVRC. Here are the failure patterns we see most often and their remedies.

SymptomRoot CauseFix
Policy outputs zero or constant actionsKL weight (beta) too high. Model ignores observations and outputs the mean action.Reduce beta by 5–10x. Monitor KL divergence: it should stabilize at 0.5–5.0 nats, not near zero.
Jerky execution at chunk boundariesTemporal ensemble temperature too low, or ensemble not implementedIncrease temperature from 10 to 20–30. Verify overlapping chunk execution in your control loop.
Good training loss, poor real-world performanceCamera mismatch between training and deploymentCheck image resize pipeline, camera FOV, crop regions. Use identical camera mount positions.
Works for 2 seconds then driftsChunk size too large for task dynamicsReduce chunk size by 50%. The policy is committing to outdated plans.
Inconsistent between evaluation runsToo few demonstrations (<30)Collect more data. With small datasets, performance is highly sensitive to train/val split.
KL divergence collapses to zeroBeta too low. Encoder encodes everything in z; decoder ignores observations.Increase beta until KL stabilizes at 0.5–5.0 nats.
Mode averaging: robot moves to "average" of multiple strategiesMulti-modal demonstrations with single-mode CVAEStandardize demonstration strategy, remove minority approaches, or switch to Diffusion Policy.
Sim-to-real gap: policy works in sim but fails on real robotContact dynamics, friction, and visual domain mismatchFine-tune on 50–100 real demonstrations. Apply domain randomization during sim training.

9. Benchmarks: ACT vs Diffusion Policy vs Behavioral Cloning

The following benchmarks are from published results and our own evaluations at SVRC on standardized manipulation tasks using the ALOHA and OpenArm platforms.

DimensionACTDiffusion PolicyBehavioral Cloning (MLP)
Success rate (50 demos, pick-place)82–90%75–85%40–60%
Success rate (200 demos, pick-place)90–95%92–97%65–80%
Success rate (bimanual insertion)80–88%72–82%15–30%
Inference latency (RTX 4090)2.5ms15ms (DDIM)0.5ms
Inference latency (Jetson Orin Nano)40ms (25 Hz)180ms (5.5 Hz)5ms (200 Hz)
Training time (200 episodes, A100)~4 hours~8 hours~1 hour
Model size~40M params~80M params~5M params
Multi-modal handlingLimited (CVAE)Excellent (diffusion)None (mode averaging)
Trajectory smoothnessExcellentVery goodPoor (jerky)
Data efficiency (<50 demos)GoodModeratePoor
Edge deployment feasibilityExcellentChallengingTrivial

10. When to Use ACT vs Other Methods

Use ACT When:

  • You have 50–200 demonstrations and need a fast path to deployment.
  • Your task has a relatively deterministic strategy (one clear way to do it).
  • You are deploying on edge hardware (Jetson) where inference speed matters.
  • You are doing bimanual manipulation (ACT was designed for ALOHA).
  • You want the simplest possible training pipeline with minimal hyperparameter tuning.
  • You need real-time control at 50 Hz on compute-constrained platforms.

Use Diffusion Policy When:

  • Your demonstrations contain multiple valid strategies for the same task (different operators, ambiguous grasps).
  • You need sub-millimeter precision on assembly or insertion tasks.
  • You have abundant data (200+ demonstrations) and compute budget.
  • You can tolerate higher inference latency (cloud inference or desktop GPU).

Use VLAs (OpenVLA, pi0) When:

  • You need language-conditioned multi-task execution ("pick the red cup", "stack the blue blocks").
  • You want to leverage pre-trained vision-language representations for generalization.
  • You are doing mobile manipulation with large state spaces.
  • You have access to multi-GPU training infrastructure (4–8 A100s).

Stay with Simple BC When:

  • You are prototyping and want the fastest possible training iteration.
  • Your task is very short (<20 steps) where compounding error is not a problem.
  • You are deploying on extremely constrained hardware (microcontroller-class).