1. What Is Action Chunking?

Action chunking is a technique in imitation learning where the policy predicts a sequence of future actions — a "chunk" — rather than a single next action. The robot executes part of this chunk, then re-queries the policy for a new overlapping chunk. The result is smoother, more temporally consistent trajectories that resist the compounding error problem that plagues single-step behavioral cloning (BC).

In standard BC, the policy predicts one action per control step. Each prediction error shifts the robot into a state the policy has never seen, causing the next prediction to be worse, and so on. Over a 200-step trajectory at 50 Hz (4 seconds of manipulation), even a 1% per-step error compounds to an ~87% probability of drifting off the demonstration distribution. Action chunking reduces this by committing to plans of k steps (typically 50–100 at 50 Hz, corresponding to 1–2 seconds of motion), cutting the number of independent decision points from 200 down to 200/k — as few as 2–4.

Chunk Size: The Most Important Hyperparameter

Chunk size k determines the trade-off between trajectory smoothness and environmental reactivity. Larger chunks produce smoother motions but cannot react to unexpected events (object shifted, obstacle appeared) until the current chunk finishes executing. Smaller chunks react faster but lose the smoothness advantage and re-introduce compounding error.

Recommended starting points by task type:

Tabletop pick-and-place: k=50–100 (1–2 seconds). The environment is static; smoothness is more valuable than reactivity.
Assembly and insertion: k=30–60. Contact-rich phases need more frequent re-planning to adjust for alignment errors.
Bimanual coordination: k=50–80. Two arms need coordinated chunks; too-short chunks break coordination.
Mobile manipulation: k=20–40 for the base, k=50–80 for the arm. The base needs faster obstacle reaction.
Dynamic tasks (catching, dodging): k=10–20. Reactivity is critical; accept some smoothness loss.

Why Action Chunking Works: The Temporal Smoothness Argument

A single-step BC policy treats each time step independently. Even if the model perfectly captures the conditional distribution P(a_t | o_t), independence across steps means the trajectory lacks temporal coherence — successive actions can be individually correct but mutually inconsistent (e.g., accelerating and decelerating in alternating steps).

By predicting k steps simultaneously, the model must produce an internally coherent trajectory segment. The training loss penalizes the entire chunk jointly, forcing the network to learn smooth motion primitives rather than isolated frame-by-frame predictions. This is analogous to how language models produce more coherent text when generating multiple tokens at once versus sampling one token at a time.

2. ACT Architecture

ACT uses a Conditional Variational Autoencoder (CVAE) with a Transformer backbone. The architecture has three main components: a CVAE encoder (used only during training), a Transformer decoder (used during both training and inference), and a vision backbone.

CVAE Encoder (Training Only)

During training, the encoder processes the full demonstration — images, joint states, and ground-truth actions — and produces a latent style variable z (dimension 32) that captures demonstration-specific variation. Different operators may approach the same task from different angles or speeds; z encodes this variation.

The encoder is a Transformer that attends over the full action sequence and the current observation. It outputs a mean and log-variance parameterizing a Gaussian, from which z is sampled via the reparameterization trick. A KL divergence term (weighted by beta, typically 10–100) regularizes z toward a standard normal prior N(0, I).

At inference, z is set to zero (the prior mean), making the policy deterministic. The CVAE structure is purely a training aid: without it, the model averages across demonstration styles, producing motions that are the mean of all approaches and therefore none of them (mode averaging). With the CVAE, the model learns to decode each style separately, and the zero-mean z at inference produces the most "typical" execution.

Transformer Decoder (Training + Inference)

The decoder takes three inputs: the latent z (or zero), observation tokens from the vision backbone + proprioception, and k learnable positional queries (one per action in the chunk). Standard Transformer cross-attention lets the positional queries attend to the observation tokens, producing k action predictions in parallel.

Architecture details from the original ACT implementation:

Encoder layers: 4 (processes observation + z into context tokens)
Decoder layers: 7 (deeper because it does more work: generating the full action chunk)
Hidden dimension: 512
Attention heads: 8
Feedforward dimension: 2,048
Total parameters: ~40M (excluding vision backbone)
Action output: k joint position targets (e.g., k=100 steps x 14 joints for bimanual ALOHA)

Vision Backbone

Each camera view is processed independently through a ResNet-18, producing a feature map (15x20x512 from a 480x640 input) that is flattened into 300 tokens per camera. With 2–4 cameras, the decoder receives 600–1,200 visual tokens.

The minimum viable camera setup is 2 views: one wrist-mounted camera (close-up manipulation detail) and one overhead camera (spatial context). Three cameras (1 wrist + 2 third-person at different angles) is the SVRC standard and consistently improves success rate by 5–15% on precision tasks compared to 2-camera setups.

3. Data Requirements

ACT learns from human teleoperation demonstrations stored as synchronized (observation, action) pairs. Data quality matters more than quantity — 50 clean demonstrations outperform 500 noisy ones.

Minimum Dataset Sizes by Task Complexity

Task Type	Demonstrations Needed	Approx. Collection Time	Notes
Simple pick-and-place (fixed location)	20–30	30 minutes	Lowest bar for proof-of-concept
Pick-and-place with pose variation	50–80	1–2 hours	Must cover workspace uniformly
Multi-step manipulation	100–150	3–4 hours	Each phase needs coverage
Bimanual coordination	150–250	5–8 hours	Coordination patterns require more examples
Contact-rich assembly (insertion, screwing)	200–400	8–16 hours	Force-sensitive phases need dense coverage

Data Quality Checklist

No pauses: Demonstrations should flow continuously. Pauses >0.5s teach the policy to predict "do nothing" chunks.
Consistent speed: Mixing fast and slow demonstrations wastes CVAE latent capacity on speed variation.
Clean start/end: Every episode starts from a similar neutral pose and ends clearly.
Success only: Remove all failed demonstrations. ACT treats all training data as expert demonstrations.
Camera consistency: Use rigid mounts. Even 5mm of camera shift between sessions degrades grasping accuracy.

Data Formats: HDF5 and RLDS

The standard format for ACT training data is HDF5, with each episode stored as a group containing datasets for images (n_cameras x H x W x 3, uint8), joint positions (qpos, float64), joint velocities (qvel, float64), and actions (float64). Timestamps are stored at the episode level. LeRobot additionally supports conversion to Parquet format for efficient streaming from Hugging Face Hub.

For cross-platform compatibility, RLDS (Reinforcement Learning Datasets) format — used by Google's Open X-Embodiment — stores episodes as TFRecord files with a standardized schema. LeRobot provides conversion utilities between HDF5, RLDS, and its native Parquet format. See our data format guide for detailed conversion instructions.

# HDF5 episode structure for ACT training
import h5py
import numpy as np

with h5py.File("episode_0042.hdf5", "r") as f:
    # Camera images: (T, H, W, 3) uint8
    cam_high = f["/observations/images/cam_high"][:]   # overhead camera
    cam_wrist = f["/observations/images/cam_wrist"][:] # wrist camera

    # Joint state: (T, n_joints) float64
    qpos = f["/observations/qpos"][:]    # joint positions (radians)
    qvel = f["/observations/qvel"][:]    # joint velocities (rad/s)

    # Actions: (T, action_dim) float64
    actions = f["/action"][:]            # target joint positions

    print(f"Episode length: {len(qpos)} steps")
    print(f"Control frequency: {f.attrs['control_freq']} Hz")
    print(f"Camera resolution: {cam_high.shape[1]}x{cam_high.shape[2]}")
    # Typical output:
    # Episode length: 400 steps
    # Control frequency: 50 Hz
    # Camera resolution: 480x640

4. Training ACT with LeRobot

LeRobot (Hugging Face) is the most actively maintained open-source framework for training ACT policies. It provides standardized data loading, training loops, evaluation scripts, and pre-built configs for common robot platforms including ALOHA, Koch v1.1, and OpenArm.

Installation

# Clone and install LeRobot
git clone https://github.com/huggingface/lerobot.git
cd lerobot
pip install -e ".[act]"

# Verify GPU is available
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0)}')"

Training Command

# Full ACT training command
python lerobot/scripts/train.py \
  --policy.name=act \
  --dataset.repo_id=svrc/openarm_pick_place \
  --policy.chunk_size=50 \
  --policy.n_obs_steps=1 \
  --policy.dim_model=512 \
  --policy.n_heads=8 \
  --policy.n_encoder_layers=4 \
  --policy.n_decoder_layers=7 \
  --policy.latent_dim=32 \
  --policy.kl_weight=10.0 \
  --training.learning_rate=1e-5 \
  --training.batch_size=8 \
  --training.num_epochs=5000 \
  --training.eval_freq=500 \
  --training.save_freq=1000 \
  --device=cuda \
  --output_dir=outputs/act_openarm_pick_place/

Hyperparameter Reference

Hyperparameter	Default	Search Range	Effect of Increasing	When to Increase
chunk_size (k)	100	20–200	Smoother trajectories, less reactive	Smooth tasks (wiping, pouring)
latent_dim (z)	32	16–64	More expressive style capture	High demonstration variance
kl_weight (beta)	10	1–100	Stronger regularization	Small datasets (<50 demos)
learning_rate	1e-5	5e-6–5e-5	Faster convergence, instability risk	Large datasets (>200 demos)
temporal_ensemble_temp	10	5–50	Slower blending, more inertia	Smoother tasks
n_cameras	2	1–4	Richer spatial info, more compute	3D reasoning tasks
backbone	ResNet-18	ResNet-18/34/50	Better visual features	Cluttered or varying-light scenes
num_epochs	3000	1000–8000	Better fit, overfitting risk	More demos or complex tasks

GPU Requirements

Minimum: 16 GB VRAM (RTX 4060 Ti, RTX 3090). Batch size 4–8. Training time: 4–8 hours for 200 episodes.
Recommended: 24 GB VRAM (RTX 4090). Batch size 16. Training time: 2–4 hours for 200 episodes.
Fast iteration: 40–80 GB (A100, H100). Batch size 32–64. Training time: 30–90 minutes. Enables rapid hyperparameter sweeps.

5. Training ACT with ACT+ (Original Repository)

The original ACT implementation from Tony Zhao's repository (tonyzhaozh/act) remains a valid training path, especially for teams already using the ALOHA hardware stack. ACT+ extends the original with proprioceptive history (last 2–5 joint states as input) for better velocity estimation.

# Training with the original ACT repo
git clone https://github.com/tonyzhaozh/act.git
cd act
pip install -r requirements.txt

# Train on custom data
python train.py \
  --task_name openarm_pick_place \
  --ckpt_dir checkpoints/openarm_pp \
  --policy_class ACT \
  --kl_weight 10 \
  --chunk_size 50 \
  --hidden_dim 512 \
  --batch_size 8 \
  --dim_feedforward 2048 \
  --num_epochs 5000 \
  --lr 1e-5 \
  --seed 0 \
  --num_steps 400 \
  --camera_names cam_high cam_wrist

LeRobot vs. original repo: LeRobot is better maintained (active community, regular updates, multi-policy support) and handles data loading and evaluation automatically. The original repo gives more direct control and is simpler to modify for research experiments. For production deployments, we recommend LeRobot; for research forks, the original repo is easier to hack.

6. Deploying on Real Hardware

Deploying an ACT policy requires meeting real-time control constraints: the inference loop must run at the robot's control frequency (typically 50 Hz = 20ms per cycle) while processing camera images and computing action chunks.

Inference Loop Architecture

import torch
import numpy as np
from collections import deque

class ACTInferenceLoop:
    def __init__(self, policy, chunk_size=50, temporal_temp=10.0, query_freq=5):
        self.policy = policy
        self.chunk_size = chunk_size
        self.temporal_temp = temporal_temp
        self.query_freq = query_freq  # re-query every N steps
        self.action_queue = deque(maxlen=chunk_size)
        self.step_count = 0

    def get_action(self, images, qpos):
        """Returns a single action with temporal ensembling."""
        if self.step_count % self.query_freq == 0:
            # Get new action chunk from policy
            with torch.no_grad():
                obs = {
                    "images": torch.tensor(images).unsqueeze(0).cuda(),
                    "qpos": torch.tensor(qpos).unsqueeze(0).cuda(),
                }
                chunk = self.policy(obs)  # (1, chunk_size, action_dim)
                chunk = chunk.squeeze(0).cpu().numpy()
            self.action_queue.append({
                "actions": chunk,
                "generated_at": self.step_count
            })

        # Temporal ensembling: blend overlapping chunks
        action = np.zeros_like(self.action_queue[0]["actions"][0])
        total_weight = 0.0

        for entry in self.action_queue:
            offset = self.step_count - entry["generated_at"]
            if offset < self.chunk_size:
                weight = np.exp(-offset / self.temporal_temp)
                action += weight * entry["actions"][offset]
                total_weight += weight

        action /= total_weight
        self.step_count += 1
        return action

Latency Budget at 50 Hz

At 50 Hz control, each cycle has a 20ms budget. A typical breakdown:

Camera capture + preprocessing: 2–5ms (USB3 cameras, resize to 480x640)
ACT inference: 2–12ms (GPU-dependent; see benchmark table below)
Action interpolation + command send: 1–2ms
Safety checks: 1ms
Margin: 0–14ms

Inference Benchmarks by Platform

Hardware	ACT Inference (ms)	Max Control Rate (Hz)	Suitable For
NVIDIA A100 (80GB)	1.8	555	Cloud/datacenter inference
NVIDIA RTX 4090	2.5	400	Desktop workstation
NVIDIA RTX 4070	4.2	238	Budget desktop
Jetson AGX Orin 64GB	12	83	Onboard robot compute
Jetson Orin NX 16GB	22	45	Mid-range edge
Jetson Orin Nano 8GB	40	25	Minimum viable edge
Apple M3 Pro 18GB	8	125	Development/prototyping

Temporal Ensemble Implementation Details

The temporal ensemble uses an exponential decay weighting: for a prediction from a chunk generated t steps ago, the weight is w(t) = exp(-t / temperature). With the default temperature of 10 at 50 Hz, predictions from the current chunk are weighted ~1.0 while predictions from a chunk generated 20 steps ago are weighted ~0.14. This produces smooth blending between consecutive chunk predictions, eliminating the jitter that would occur at chunk boundaries with naive sequential execution.

Critical deployment tip: Always use temporal ensembling in production. Without it, the robot exhibits visible jitter at chunk boundaries — every k/query_freq steps, the robot suddenly shifts to a new plan. With ensembling, transitions are smooth and imperceptible. The computational overhead is negligible (a few microseconds of weighted averaging).

7. OpenArm 1 + ACT: Specific Setup Steps

The OpenArm 1 ($4,500 single arm, $9,000 bimanual leader-follower station) is designed for imitation learning data collection and ACT deployment. Here is the complete setup path:

Hardware Setup

Mount cameras: Attach the wrist camera (Intel RealSense D405) to the OpenArm wrist bracket. Mount the overhead camera (RealSense D435) on the provided gantry at 60cm above the workspace center. Optional: mount a third camera at 45 degrees for improved spatial coverage.
Connect servos: OpenArm uses Feetech STS3215 servos (same as ALOHA). Connect via the U2D2 serial interface. Verify all 6 joints respond: python openarm/scan_servos.py.
Calibrate joint limits: Run python openarm/calibrate.py to record joint zero positions and soft limits. This writes config/joint_calibration.json.
Test teleoperation: With the leader arm connected, run python openarm/teleop_test.py. The follower should mirror the leader's motion in real time at 50 Hz.

Data Collection for ACT

# Record demonstrations on OpenArm
python lerobot/scripts/control_robot.py record \
  --robot-path lerobot/configs/robot/openarm.yaml \
  --fps 50 \
  --repo-id svrc/openarm_pick_place \
  --num-episodes 50 \
  --warmup-time-s 3 \
  --episode-time-s 15 \
  --reset-time-s 10 \
  --push-to-hub 1

Deploy ACT on OpenArm

# Deploy trained policy
python lerobot/scripts/control_robot.py replay \
  --robot-path lerobot/configs/robot/openarm.yaml \
  --policy-path outputs/act_openarm_pick_place/checkpoints/best/ \
  --fps 50 \
  --num-episodes 20

8. Common Failures and Fixes

We have trained and deployed ACT policies on 20+ manipulation tasks at SVRC. Here are the failure patterns we see most often and their remedies.

Symptom	Root Cause	Fix
Policy outputs zero or constant actions	KL weight (beta) too high. Model ignores observations and outputs the mean action.	Reduce beta by 5–10x. Monitor KL divergence: it should stabilize at 0.5–5.0 nats, not near zero.
Jerky execution at chunk boundaries	Temporal ensemble temperature too low, or ensemble not implemented	Increase temperature from 10 to 20–30. Verify overlapping chunk execution in your control loop.
Good training loss, poor real-world performance	Camera mismatch between training and deployment	Check image resize pipeline, camera FOV, crop regions. Use identical camera mount positions.
Works for 2 seconds then drifts	Chunk size too large for task dynamics	Reduce chunk size by 50%. The policy is committing to outdated plans.
Inconsistent between evaluation runs	Too few demonstrations (<30)	Collect more data. With small datasets, performance is highly sensitive to train/val split.
KL divergence collapses to zero	Beta too low. Encoder encodes everything in z; decoder ignores observations.	Increase beta until KL stabilizes at 0.5–5.0 nats.
Mode averaging: robot moves to "average" of multiple strategies	Multi-modal demonstrations with single-mode CVAE	Standardize demonstration strategy, remove minority approaches, or switch to Diffusion Policy.
Sim-to-real gap: policy works in sim but fails on real robot	Contact dynamics, friction, and visual domain mismatch	Fine-tune on 50–100 real demonstrations. Apply domain randomization during sim training.

9. Benchmarks: ACT vs Diffusion Policy vs Behavioral Cloning

The following benchmarks are from published results and our own evaluations at SVRC on standardized manipulation tasks using the ALOHA and OpenArm platforms.

Dimension	ACT	Diffusion Policy	Behavioral Cloning (MLP)
Success rate (50 demos, pick-place)	82–90%	75–85%	40–60%
Success rate (200 demos, pick-place)	90–95%	92–97%	65–80%
Success rate (bimanual insertion)	80–88%	72–82%	15–30%
Inference latency (RTX 4090)	2.5ms	15ms (DDIM)	0.5ms
Inference latency (Jetson Orin Nano)	40ms (25 Hz)	180ms (5.5 Hz)	5ms (200 Hz)
Training time (200 episodes, A100)	~4 hours	~8 hours	~1 hour
Model size	~40M params	~80M params	~5M params
Multi-modal handling	Limited (CVAE)	Excellent (diffusion)	None (mode averaging)
Trajectory smoothness	Excellent	Very good	Poor (jerky)
Data efficiency (<50 demos)	Good	Moderate	Poor
Edge deployment feasibility	Excellent	Challenging	Trivial

10. When to Use ACT vs Other Methods

Use ACT When:

You have 50–200 demonstrations and need a fast path to deployment.
Your task has a relatively deterministic strategy (one clear way to do it).
You are deploying on edge hardware (Jetson) where inference speed matters.
You are doing bimanual manipulation (ACT was designed for ALOHA).
You want the simplest possible training pipeline with minimal hyperparameter tuning.
You need real-time control at 50 Hz on compute-constrained platforms.

Use Diffusion Policy When:

Your demonstrations contain multiple valid strategies for the same task (different operators, ambiguous grasps).
You need sub-millimeter precision on assembly or insertion tasks.
You have abundant data (200+ demonstrations) and compute budget.
You can tolerate higher inference latency (cloud inference or desktop GPU).

Use VLAs (OpenVLA, pi0) When:

You need language-conditioned multi-task execution ("pick the red cup", "stack the blue blocks").
You want to leverage pre-trained vision-language representations for generalization.
You are doing mobile manipulation with large state spaces.
You have access to multi-GPU training infrastructure (4–8 A100s).

Stay with Simple BC When:

You are prototyping and want the fastest possible training iteration.
Your task is very short (<20 steps) where compounding error is not a problem.
You are deploying on extremely constrained hardware (microcontroller-class).

Collect ACT-Compatible Robot Data at SVRC

SVRC provides end-to-end data collection for ACT training: calibrated leader-follower hardware, trained operators, quality control, and delivery in LeRobot HDF5 format. Start with a $2,500 pilot (50 demonstrations on one task) or scale to a full $8,000 data campaign.

Data Collection Services — $2,500 Pilot Lease Hardware for Your Own Collection

Deep Dives

Action Chunking Transformer (ACT): Complete Guide to Training and Deploying ACT Policies (2026)