What Is Action Chunking and Why It Matters

Standard behavioral cloning predicts a single next action from the current observation. This works in theory but breaks down in practice: small prediction errors accumulate over time, causing the robot to drift into states never seen during training. This compounding error problem is the fundamental limitation of naive behavioral cloning.

Action chunking solves this by predicting a sequence of future actions at once — typically 50–100 timesteps. Instead of asking "what should the robot do right now?", the model answers "what should the robot do for the next two seconds?" This forces the network to produce coherent, temporally consistent trajectories rather than noisy step-by-step predictions.

The intuition is straightforward: when you reach for a cup, you do not decide each millisecond independently. You plan a smooth reaching motion as a single unit. Action chunking gives the model the same structure — it predicts a complete motion primitive, then executes it. Crucially, overlapping chunks are blended through temporal ensembling, which averages predictions from adjacent chunks to further smooth the trajectory and reduce jitter at chunk boundaries.

ACT (Action Chunking with Transformers), introduced by Tony Zhao et al. at Stanford in 2023, combined this action chunking idea with a CVAE (Conditional Variational Autoencoder) Transformer architecture. The result was the most data-efficient imitation learning algorithm available: 50 demonstrations are often sufficient for simple pick-and-place tasks with 80%+ success rates.

Temporal Abstraction: Reducing the Effective Horizon

A typical manipulation task at 50 Hz control requires 250–500 individual action predictions for a 5–10 second task. With standard BC, each prediction compounds error. With a chunk size of 100, the same task only requires 3–5 chunk predictions. The effective decision horizon shrinks by 20–50x, making the learning problem dramatically easier.

This is why ACT works with so few demonstrations: the learning problem is simpler. Instead of learning a mapping from observation to a single 7-DOF action (for a typical robot arm), the model learns a mapping from observation to a 100×7 action trajectory. The trajectory structure provides strong self-supervision — each action in the chunk must be consistent with the others.

ACT Architecture Deep-Dive

ACT uses a Conditional Variational Autoencoder (CVAE) with a Transformer encoder-decoder backbone. Understanding this architecture is essential for effective training and debugging.

The CVAE Framework

A CVAE adds a latent variable z to the standard encoder-decoder pipeline. During training, the encoder observes both the current state and the ground truth action sequence, compressing them into a distribution over z. The decoder takes a sample from this distribution plus the current observation and reconstructs the action sequence.

During inference, the encoder is discarded. The decoder receives a sample from the prior (a standard normal distribution) and the current observation, then generates the action chunk. The KL divergence term in the loss ensures the posterior stays close to the prior, so sampling from the prior at test time produces reasonable action sequences.

The CVAE structure serves a critical purpose: it captures intra-demonstration variability. Even for a single task, the exact trajectory varies between demonstrations (slightly different grasp angle, timing, approach path). The latent variable z captures this variation, allowing the decoder to produce diverse but valid trajectories.

Encoder Architecture

The ACT encoder processes multi-modal inputs:

  • Camera images: Passed through a visual backbone (ResNet-18 by default, ViT optional). Multiple camera views (wrist + overhead) are processed independently and concatenated.
  • Joint positions (qpos): The current robot joint angles, typically 6–7 DOF per arm plus gripper state.
  • Action sequence (training only): The ground truth chunk of k future actions, used by the CVAE encoder to compute the posterior distribution.

These inputs are tokenized and fed into a Transformer encoder. The output is a latent representation that is projected to the mean and variance of the posterior distribution q(z|o, a).

Decoder Architecture

The decoder is a standard Transformer decoder with learnable action queries. There are k queries (one per timestep in the chunk), each initialized as a learned embedding. The decoder cross-attends to the encoded observation tokens and the sampled latent z, then outputs k action vectors through a linear projection head.

Each predicted action is a vector containing joint position targets (6–7 DOF per arm) and gripper open/close commands. The loss is L1 (mean absolute error) between predicted and ground truth actions, plus a KL divergence term weighted by β.

Loss Function

# ACT loss computation (simplified)
L_reconstruction = L1_loss(predicted_actions, target_actions)  # [batch, chunk_size, action_dim]
L_kl = KL_divergence(posterior_mean, posterior_logvar, prior)  # regularize latent space
L_total = L_reconstruction + beta * L_kl                       # beta typically 10-20

The β weight on the KL term controls the trade-off between reconstruction accuracy and latent space regularity. Too low: the model ignores the prior and generates inconsistent actions at test time. Too high: the model collapses to mean behavior and loses expressiveness.

Mathematical Intuition: Chunk Size Selection

Chunk size k is the single most important hyperparameter in ACT. It determines how many future actions the model predicts at each step.

The Control Frequency Relationship

At 50 Hz control (the standard for most robot arms), chunk size maps directly to prediction horizon:

Chunk Size Prediction Horizon Best For
250.5 secondsVery fast reactive tasks, contact transitions
501.0 secondQuick pick-and-place, single-arm tasks under 3 seconds
1002.0 secondsDefault — most tabletop manipulation tasks
1503.0 secondsSlow precision tasks (insertion, assembly)
2004.0 secondsLong-horizon smooth motions (pouring, wiping)

The ideal chunk size should cover roughly one complete sub-motion. For a pick-and-place task: the reach-to-grasp motion takes ~1.5 seconds, so chunk_size=75–100 captures the entire approach. If the chunk is too short, the model cannot plan a complete motion. If too long, the model wastes capacity predicting far-future actions that will be overridden by the next chunk anyway.

Temporal Ensembling

During deployment, ACT generates a new chunk at every timestep (or every few timesteps). At each control step, the executed action is a weighted average of overlapping chunk predictions:

# Temporal ensembling (exponential weighting)
# w controls the blending — lower w gives more weight to recent predictions
w = 0.01  # default temporal ensembling weight
for t in range(episode_length):
    new_chunk = policy.predict(observation_t)  # shape: [chunk_size, action_dim]
    for i in range(chunk_size):
        all_predictions[t + i].append(new_chunk[i])

    # Weighted average with exponential decay
    weights = [np.exp(-w * j) for j in range(len(all_predictions[t]))]
    action_t = np.average(all_predictions[t], weights=weights, axis=0)

The temporal ensembling weight w (typically 0.01) controls how quickly older predictions decay. Smaller w gives more weight to older predictions (smoother but slower to react). Larger w favors the most recent prediction (more reactive but potentially jittery).

Data Collection Requirements for ACT

ACT is remarkably data-efficient, but data quality is paramount. Garbage in, garbage out applies strongly — 50 clean demonstrations outperform 500 noisy ones.

Minimum Episode Counts

Task Category Episodes Needed Expected Success Rate Collection Time
Simple pick-and-place50–10085%+2–4 hours
Bimanual coordination100–20080%+4–8 hours
Precision insertion (peg-in-hole)200–40070%+1–2 days
Complex multi-step tasks500+60%+2–5 days
Dexterous manipulation500–1,00050%+1–2 weeks

Data Quality Checklist

  • Consistent demonstrations: Use one operator (or operators with very similar styles). ACT assumes a unimodal action distribution — mixed styles confuse the model.
  • Fixed object placement zone: Objects should start in roughly the same area (within ~5 cm variation). ACT does not generalize well to large spatial shifts unless trained on them explicitly.
  • Clean start/end states: Every episode should start from the same home position and end cleanly (gripper closed on object, or object placed at target). Truncated or failed episodes should be excluded.
  • Consistent timing: Try to execute the task at a similar speed across demonstrations. Large timing variation creates ambiguity in the action distribution.
  • Camera stability: Cameras must be rigidly mounted. Any camera shift between episodes degrades performance significantly.

HDF5 Data Format

SVRC delivers all collected data in HDF5 format, which is the standard for ACT training. Each episode is stored as a single HDF5 file with the following structure:

# HDF5 episode structure for ACT training
episode_0.hdf5
├── observations/
│   ├── qpos          # [T, 7] — joint positions (6 DOF + gripper)
│   ├── qvel          # [T, 7] — joint velocities
│   ├── images/
│   │   ├── cam_high  # [T, 480, 640, 3] — overhead camera
│   │   └── cam_wrist # [T, 480, 640, 3] — wrist camera
│   └── effort        # [T, 7] — joint torques (optional)
├── action            # [T, 7] — target joint positions
└── attrs/
    ├── sim           # False (real robot data)
    ├── compress      # True
    └── num_timesteps # T

For ACT-ready data collection services, see SVRC Data Services. We provide calibrated multi-camera setups, trained operators, quality control, and delivery in HDF5 or LeRobot Parquet format.

Camera Setup Recommendations

ACT performance depends heavily on camera configuration. The original ALOHA setup uses three cameras:

  • Overhead camera (required): Mounted 0.8–1.2 m above the workspace, pointing straight down. Provides a global view of the scene and object positions. Resolution: 640×480 minimum.
  • Wrist camera (strongly recommended): Mounted on the robot's wrist or forearm. Provides close-up view during grasping and insertion. Critical for precision tasks where the overhead camera cannot resolve fine details.
  • Side camera (optional): Provides depth information that overhead and wrist views may miss. Most useful for tasks involving vertical stacking or height-sensitive placement.

Use USB 3.0 cameras (Logitech C920 or Intel RealSense D405) with fixed exposure and white balance. Auto-exposure causes inconsistent image brightness between episodes, which degrades training.

Training Walkthrough

This section walks through training an ACT policy using both the LeRobot framework and a direct PyTorch implementation.

Training with LeRobot (Recommended)

Hugging Face's LeRobot is the most accessible framework for ACT training in 2026. It handles data loading, augmentation, and evaluation out of the box.

# Step 1: Install LeRobot
pip install lerobot

# Step 2: Convert HDF5 data to LeRobot format
python -m lerobot.scripts.convert_dataset \
  --raw-dir ./my_episodes/ \
  --repo-id my-org/pick-place-task \
  --raw-format hdf5_aloha

# Step 3: Train ACT policy
python -m lerobot.scripts.train \
  policy=act \
  dataset_repo_id=my-org/pick-place-task \
  training.num_epochs=2000 \
  training.batch_size=8 \
  policy.chunk_size=100 \
  policy.n_action_steps=100 \
  policy.dim_model=512

# Step 4: Evaluate
python -m lerobot.scripts.eval \
  --pretrained-policy-name-or-path outputs/train/act_pick-place-task/checkpoints/last/pretrained_model \
  --env-name real_world \
  --n-episodes 20

Training typically converges in 2–8 hours on a single RTX 4090 for 100–200 episodes. Monitor the L1 action loss — it should decrease steadily and plateau around epoch 1500–2000.

Direct PyTorch Training Configuration

For teams who need more control over the training loop or want to integrate ACT into a custom pipeline:

# ACT training configuration — direct PyTorch implementation
import torch
from act.policy import ACTPolicy
from act.train import train_bc

config = {
    # Architecture
    'chunk_size': 100,          # action chunk length (timesteps)
    'hidden_dim': 512,          # transformer hidden dimension
    'dim_feedforward': 3200,    # feedforward network dimension
    'num_encoder_layers': 4,    # transformer encoder layers
    'num_decoder_layers': 7,    # transformer decoder layers
    'nheads': 8,                # attention heads
    'latent_dim': 32,           # CVAE latent dimension

    # Training
    'lr': 1e-5,                 # learning rate (Adam)
    'weight_decay': 1e-4,       # L2 regularization
    'num_epochs': 2000,         # total training epochs
    'batch_size': 8,            # batch size (8 fits on 24 GB VRAM)
    'kl_weight': 10,            # beta for KL divergence term
    'seed': 42,

    # Data
    'camera_names': ['cam_high', 'cam_wrist'],
    'image_size': [480, 640],   # H, W
    'action_dim': 7,            # 6 DOF + gripper
    'state_dim': 7,             # joint positions

    # Augmentation
    'color_jitter': True,       # random brightness/contrast
    'random_crop': True,        # spatial augmentation
}

# Initialize and train
policy = ACTPolicy(config)
optimizer = torch.optim.AdamW(policy.parameters(), lr=config['lr'], weight_decay=config['weight_decay'])
train_bc(policy, optimizer, dataloader, config)

Training Tips

  • Learning rate: Start at 1e-5. ACT is sensitive to learning rate — too high (1e-4) causes training instability, too low (1e-6) converges too slowly.
  • Batch size: Use 8 for 24 GB VRAM (RTX 4090). Reduce to 4 for 12 GB VRAM (RTX 3060). Larger batches (16–32) on A100 improve stability.
  • Epochs: Train for 2000 epochs minimum. ACT often shows sudden improvement around epoch 1000–1500 after a long plateau.
  • Checkpointing: Save every 500 epochs. Evaluate all saved checkpoints — the last checkpoint is not always the best due to overfitting.
  • Action normalization: Normalize actions to [-1, 1] using the dataset statistics. Unnormalized actions with different scales per joint cause training failure.

Hyperparameter Tuning Guide

ACT has fewer hyperparameters than Diffusion Policy, but three parameters require careful tuning for each new task.

Chunk Size (chunk_size)

The most impactful parameter. Follow this decision process:

  1. Measure the average duration of a single sub-motion in your task (e.g., reaching to object: 1.5 s).
  2. Multiply by control frequency: 1.5 s × 50 Hz = 75 timesteps.
  3. Round up to nearest 25: chunk_size = 100.
  4. If the task has very distinct phases (reach, grasp, retract), the chunk should cover at least one full phase.

Signs your chunk size is wrong:

  • Too small: robot hesitates at transitions, jerky motion, gets stuck mid-motion.
  • Too large: robot overpredicts past the end of the task, makes premature motions, wastes training capacity on unnecessary future predictions.

KL Weight (kl_weight / β)

Controls CVAE regularization. The default is 10.

  • β = 0: No CVAE regularization. The model degenerates to standard behavioral cloning (no latent variable). Use this as a baseline but expect worse performance.
  • β = 10 (default): Good balance for most tasks. The latent space captures demonstration variability without collapsing.
  • β = 20–50: Stronger regularization, produces more consistent (less variable) trajectories. Useful when demonstrations are already very consistent.
  • β > 100: Too strong. The model ignores the data and produces average behavior. Avoid.

Number of Queries (num_queries)

In the original ACT, num_queries equals chunk_size (one query per timestep). Some implementations allow decoupling these, using fewer queries with an upsampling layer. Unless you have a specific reason to change this, keep num_queries = chunk_size.

Visual Backbone

ResNet-18 is the default and works well for most tasks. Consider alternatives when:

  • ViT (DINOv2): Better for tasks requiring fine-grained visual understanding (reading labels, precise alignment). Increases training time 2–3x.
  • ResNet-50: Marginal improvement over ResNet-18 for most manipulation tasks, but doubles memory usage. Rarely worth it.
  • EfficientNet: Good compromise for mobile deployment with limited compute.

Hardware Recommendations

ACT has been validated on a range of robot platforms. Here are the best options available through SVRC:

Robot Platform Price Configuration Best ACT Use Case Lease from SVRC
OpenArm 101 $4,500 6-DOF, leader-follower, open-source Single-arm ACT research, entry-level imitation learning from $800/mo
DK1 Bimanual $12,000 Dual 6-DOF arms, ALOHA-compatible Bimanual ACT (original ALOHA workflow), folding, assembly from $1,500/mo
Unitree G1 $16,000 23-DOF humanoid, dual arms + locomotion Whole-body ACT, mobile manipulation, loco-manipulation from $2,500/mo
Unitree Go2 $2,800 Quadruped, arm attachment option Locomotion + arm manipulation, outdoor tasks from $800/mo

Compute Requirements

GPU VRAM Max Batch Size Training Time (200 eps) Inference Latency
RTX 306012 GB48–16 hours~20 ms
RTX 409024 GB82–8 hours~12 ms
A10040/80 GB16–321–3 hours~8 ms
H10080 GB32–64<1 hour~5 ms

ACT is one of the most compute-efficient imitation learning algorithms. A consumer RTX 3060 is sufficient for training and real-time deployment, making it accessible to small labs and individual researchers.

Common Failure Modes and Fixes

ACT training is relatively straightforward, but deployment failures are common. Here are the most frequent issues and their solutions:

1. Robot drifts off trajectory mid-task

Cause: Insufficient demonstrations or inconsistent demonstration quality.

Fix: Collect 50 more demonstrations focusing on the exact region where drifting occurs. Ensure all demonstrations follow the same strategy. Check that camera positions have not shifted.

2. Jerky or oscillating motions

Cause: Temporal ensembling weight too high, or chunk size too small.

Fix: Reduce the temporal ensembling weight (try 0.005 instead of 0.01). Increase chunk_size by 25–50. Check that action normalization statistics are computed correctly.

3. Robot freezes at grasp/release transitions

Cause: Ambiguous gripper commands in the demonstration data. Some demos grasp early, others late, creating a bimodal distribution that ACT averages into "half open."

Fix: Re-collect demonstrations with consistent grasp timing. Alternatively, use a binary gripper command (thresholded) instead of continuous gripper position.

4. Training loss plateaus but performance is poor

Cause: KL weight too high (β > 50), causing posterior collapse. The model learns to ignore the latent variable and predicts average behavior.

Fix: Reduce kl_weight to 10 or lower. Monitor the KL loss separately — it should be non-zero (typically 1–10) during training.

5. Works on training objects but fails on new objects

Cause: ACT does not generalize to new objects by default. It is a single-task algorithm.

Fix: Collect demonstrations with diverse objects (different colors, sizes, shapes). Use image augmentation (color jitter, random crop). For true object generalization, consider switching to a VLA model (see our imitation learning guide).

6. Bimanual coordination breaks down

Cause: Arms are predicted independently, losing coordination timing.

Fix: Ensure both arms' actions are in the same action vector (14-DOF: 7 per arm). Do not train separate policies per arm. Increase chunk_size to cover the full coordinated motion.

ACT vs. Diffusion Policy: When to Use Which

Dimension ACT Diffusion Policy
Data efficiency50–200 demos100–500 demos
Multi-modal actionsLimited (CVAE helps but not fully multimodal)Excellent (denoising naturally captures multiple modes)
Inference speed50+ Hz (single forward pass)10–30 Hz (iterative denoising)
Trajectory smoothnessGood (temporal ensembling)Excellent (inherent to diffusion process)
Training time2–8 hours (RTX 4090)4–16 hours (RTX 4090)
Hyperparameter sensitivityLow (mainly chunk_size and kl_weight)Medium (noise schedule, denoising steps, network architecture)
Language conditioningNot supported nativelyNot supported natively (extensions exist)
Framework supportLeRobot, robomimicLeRobot, robomimic, diffusion_policy repo
Best forFast prototyping, single-strategy tasks, limited data budgetsMulti-strategy tasks, contact-rich manipulation, diverse operator data

Rule of thumb: Start with ACT. It trains faster, requires less data, and lets you validate your data collection pipeline quickly. Switch to Diffusion Policy only if you observe multi-modal failure (robot averages between two strategies) or need smoother trajectories for contact-rich tasks.

Deployment on Real Robots

Once trained, deploying an ACT policy involves loading the model and running it in a control loop. Here is a minimal deployment script:

# ACT deployment on real robot (pseudocode)
import torch
import numpy as np
from act.policy import ACTPolicy

# Load trained policy
policy = ACTPolicy.load("checkpoints/act_epoch_2000.pt")
policy.eval()
policy.cuda()

# Initialize robot and cameras
robot = RobotInterface(port="/dev/ttyUSB0", hz=50)
cameras = CameraInterface(names=["cam_high", "cam_wrist"])

# Action buffer for temporal ensembling
all_actions = {}
temporal_weight = 0.01

for t in range(max_steps):
    # Get observation
    images = cameras.get_images()   # dict of [H, W, 3] arrays
    qpos = robot.get_joint_positions()  # [7]

    # Predict action chunk
    with torch.no_grad():
        obs = preprocess(images, qpos)   # normalize, resize, to_tensor
        action_chunk = policy(obs)       # [chunk_size, action_dim]
        action_chunk = action_chunk.cpu().numpy()

    # Temporal ensembling
    for i in range(len(action_chunk)):
        if (t + i) not in all_actions:
            all_actions[t + i] = []
        all_actions[t + i].append(action_chunk[i])

    if t in all_actions:
        preds = np.array(all_actions[t])
        weights = np.exp(-temporal_weight * np.arange(len(preds)))[::-1]
        action = np.average(preds, weights=weights, axis=0)
    else:
        action = action_chunk[0]

    # Execute action
    robot.set_joint_positions(action[:6])
    robot.set_gripper(action[6])
    robot.step()

robot.go_home()

Advanced Topics

ACT with VLA Pre-Training

Recent work (2025–2026) combines ACT's action chunking with VLA pre-training. The idea: use a pre-trained vision-language model as the visual encoder, then attach ACT's chunked action decoder. This gives you ACT's data efficiency and inference speed with the generalization capabilities of foundation models. See our imitation learning guide for details on VLA models.

ACT for Dexterous Manipulation

Applying ACT to dexterous hands (like the Orca Hand with 17 DOF) requires adjustments:

  • Increase action_dim to match the hand DOF (e.g., 23 for 6-DOF arm + 17-DOF hand).
  • Increase chunk_size to 150–200 — dexterous manipulation motions are slower and more complex.
  • Use wrist camera as the primary visual input — the overhead camera cannot resolve finger configurations.
  • Collect 500+ demonstrations — the higher-dimensional action space requires more data.

Multi-Task ACT

Standard ACT is single-task. For multi-task operation, two approaches work in practice:

  1. Task-conditioned ACT: Add a task embedding (one-hot or learned) to the encoder input. Train on mixed data from multiple tasks. Works for 3–5 related tasks.
  2. Language-conditioned ACT: Replace the task embedding with a language instruction encoded by a frozen CLIP or SigLIP model. More flexible but requires more data per task. See tools like LeRobot and robomimic for implementations.

Frequently Asked Questions

What is Action Chunking with Transformers (ACT)?

ACT is a robot imitation learning algorithm introduced by Tony Zhao et al. at Stanford in 2023. It predicts a sequence (chunk) of future actions — typically 100 timesteps at 50 Hz, covering 2 seconds of motion — instead of a single next action. This reduces compounding errors and produces smooth, temporally consistent robot trajectories from as few as 50 human demonstrations.

How many demonstrations does ACT need?

50–100 for simple pick-and-place, 100–200 for bimanual coordination, 200–400 for precision insertion, and 500+ for complex multi-step or dexterous tasks. Quality matters more than quantity — use a single skilled operator and maintain consistent demonstration style.

What chunk size should I use?

Start with 100 at 50 Hz (2 seconds). For fast tasks under 3 seconds total, reduce to 50. For slow precision tasks, increase to 150–200. The chunk should cover roughly one complete sub-motion of your task.

Can ACT handle multiple valid strategies?

ACT's CVAE latent variable provides some multi-modality, but it is fundamentally designed for unimodal action distributions. If your task has genuinely different valid strategies (e.g., grasp from left vs. right), ACT will average them, producing a failed middle trajectory. Use Diffusion Policy for multi-modal tasks.

What data format does ACT expect?

HDF5 is the standard format. Each episode contains joint positions (qpos), joint velocities (qvel), camera images, and target actions. SVRC's data collection service delivers data in this format, ready for training. See our data format guide for detailed specs.