Part of: Imitation Learning Guide

ACT vs Diffusion Policy: Which Robot Learning Algorithm Should You Use? (2025)

A deep-dive comparison for researchers and ML engineers choosing between Action Chunking with Transformers and Diffusion Policy for imitation learning.

Both algorithms learn from demonstrations — the key difference is how they represent and generate actions

Demos Policy Robot

ACT (Action Chunking with Transformers) and Diffusion Policy are the two most widely used imitation learning algorithms in robot manipulation research as of 2025. Both have shipped in production research labs and on real hardware. Choosing between them is not a matter of one being universally better — it depends on your task structure, data budget, inference constraints, and how much implementation complexity you can absorb. This article lays out the differences precisely enough to make that call.

What Is ACT (Action Chunking with Transformers)?

ACT was introduced by Tony Zhao et al. at Stanford in the paper "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware" (RSS 2023). The key insight is action chunking: instead of predicting one action at a time, the policy predicts a short sequence — a chunk — of future actions (typically 50–100 timesteps at 50 Hz). This reduces the effective decision frequency and mitigates compounding errors from single-step prediction.

ACT uses a CVAE (Conditional Variational Autoencoder) architecture. During training, an encoder maps the full demonstrated action sequence into a style latent z. A transformer decoder then predicts the action chunk conditioned on current image observations and z. At inference, z is sampled from the prior (a unit Gaussian). Temporal ensemble — averaging overlapping chunk predictions — further smooths execution.

ACT Architecture at a Glance

Backbone: ResNet-18 or ResNet-50 for image encoding; standard transformer encoder-decoder
Action representation: Continuous joint positions, predicted as a chunk of length k (default: 100 at 50 Hz = 2 seconds)
Loss: L1 regression on action chunks + KL divergence on the style latent
Inference speed: Fast — a single forward pass produces the full chunk; effective control at 50 Hz with chunk reuse
Parameters: ~85M (ACT with ResNet-50 + transformer)

What Is Diffusion Policy?

Diffusion Policy was introduced by Chi et al. at Columbia and MIT in "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion" (RSS 2023). It reframes robot policy learning as a conditional denoising diffusion process. At training time, noise is progressively added to expert action sequences. The model — a U-Net or transformer — learns to reverse this process: given noisy actions and the current observation, predict the clean action.

At inference, actions are sampled by starting from Gaussian noise and iteratively denoising for T steps (typically 10–100 steps with DDPM, or 1–10 steps with DDIM acceleration). This produces a full action prediction horizon (typically 8–16 steps) at each call. The multimodal nature of the diffusion process means the policy can represent multiple valid ways to complete a task from the same observation.

Diffusion Policy Architecture at a Glance

Backbone: CNN encoder for images; noise prediction network is either a U-Net over the action sequence or a transformer (DiT-style)
Action representation: Continuous joint positions or end-effector poses, over a prediction horizon H (typically 8–16 steps)
Loss: Denoising score matching (MSE on noise prediction)
Inference speed: Slower than ACT — requires T denoising steps per action query; DDIM reduces this significantly (10-step inference practical)
Parameters: ~70–300M depending on backbone choice

Head-to-Head Comparison

Dimension	ACT	Diffusion Policy
Core mechanism	CVAE + transformer, predicts action chunk directly	Conditional denoising diffusion over action horizon
Multimodal action distributions	Limited — CVAE latent provides some coverage but collapses on complex distributions	Excellent — diffusion naturally represents multimodal distributions
Data efficiency	High — works well with 50–200 demos	Moderate — typically needs 100–500 demos; benefits more from scale
Inference speed	Fast — single forward pass, chunk reuse; ~5ms/query on A100	Slower — 10–100 denoising steps; ~50–200ms with DDPM, ~15ms with DDIM on A100
Real-time control	Excellent — designed for 50 Hz with temporal ensemble	Feasible with DDIM + action horizon receding; requires tuning
Task types where it wins	Bimanual manipulation, ALOHA-style tasks, precise assembly, long-horizon tasks with clear structure	Dexterous grasping, contact-rich tasks, tasks with multiple valid execution modes
Hardware requirements	Single GPU (RTX 3090 sufficient); ~8GB VRAM for inference	Single GPU; DDPM needs ~16GB, DDIM feasible on RTX 3090; more latency-sensitive
Implementation complexity	Moderate — well-documented, LeRobot reference implementation	Higher — denoising schedule tuning, inference step count tradeoffs
Training time (100 demos)	~4–8 hours on a single RTX 4090	~6–12 hours on a single RTX 4090
Hyperparameter sensitivity	Moderate — chunk size and KL weight matter	High — noise schedule, diffusion steps, and horizon are all critical
Open-source maturity	Very high — original codebase + LeRobot	High — original codebase + LeRobot + multiple reproductions

Benchmark Numbers

The following numbers come from reported results in papers and reproducibility studies. Task success rates vary significantly with demo quality, camera setup, and robot calibration — treat these as directional, not absolute.

LIBERO Benchmark (Chi et al. reproductions, 2024)

Suite	ACT Success Rate	Diffusion Policy Success Rate
LIBERO-Spatial (10 tasks)	84.7%	78.2%
LIBERO-Object (10 tasks)	90.1%	82.4%
LIBERO-Goal (10 tasks)	71.6%	68.9%
LIBERO-Long (10 tasks)	53.8%	48.1%

ALOHA Bimanual Tasks (Original ACT paper, Zhao et al. 2023)

Task	ACT (50 demos)	BC-Transformer (50 demos)	Diffusion Policy (50 demos)
Slot insertion	62%	24%	38%
Cup unstacking	98%	72%	84%
Bag transfer	100%	44%	58%

RLBench (Diffusion Policy paper, Chi et al. 2023)

On tasks with multimodal action distributions (e.g., picking from multiple valid approach directions), Diffusion Policy shows a strong advantage over regression-based policies including standard ACT: +15–25% success rate on tasks like "push buttons" and "open drawer" where the approach angle is ambiguous. ACT regresses to the mean and fails on these; Diffusion Policy samples a mode and commits.

When to Choose ACT

Bimanual manipulation: ACT was designed for ALOHA bimanual setups and excels at synchronized two-arm coordination. The action chunk encodes the temporal correlation between both arms naturally.
Precise assembly tasks: Peg insertion, lid placement, plug insertion. The chunking mechanism reduces compounding positioning errors.
Low demo budgets: If you have fewer than 100 demonstrations, ACT typically converges more reliably. The CVAE latent provides sufficient expressivity without requiring thousands of examples to fill out a diffusion manifold.
Speed-sensitive deployment: If you need to run at 50 Hz on modest hardware (e.g., a local workstation GPU), ACT's single-pass inference is significantly more tractable than DDPM diffusion.
Long-horizon tasks with clear structure: ACT's 2-second chunks are well-suited to tasks that decompose into phases (reach, grasp, place) where each phase has low within-phase variance.

When to Choose Diffusion Policy

Dexterous manipulation with contact uncertainty: When approach angle, grasp orientation, or finger placement has multiple valid options, diffusion naturally represents the full distribution.
Tasks with ambiguous expert demonstrations: If your teleoperation data contains multiple distinct strategies for the same task start state, Diffusion Policy will not average them into a bad policy the way regression-based methods do.
High-DoF dexterous hands: Finger trajectory planning has high intrinsic multimodality. Diffusion Policy performs better on DEXMV-style tasks than ACT.
Larger data budgets at scale: Diffusion Policy's performance scales better with dataset size because the denoising objective does not collapse representations the way CVAE KL regularization can at scale.
State-based (non-visual) policies: When operating from proprioceptive state alone, the U-Net backbone of Diffusion Policy is particularly efficient and well-studied.

The Bimanual Case: ACT Wins

For bimanual manipulation — the task class defined by ALOHA, DK1, and similar platforms — ACT has a systematic advantage. The action chunk encodes the temporal relationship between left and right arm actions directly in the prediction target. The robot executes the full chunk, which means coordination is "baked in" to the predicted trajectory rather than emerging from two separately-queried policies.

Diffusion Policy can be adapted for bimanual use by concatenating both arm action spaces, but the increased dimensionality makes the denoising problem harder and data requirements grow substantially. In practice, labs doing bimanual research (Stanford ALOHA group, UMass, CMU) default to ACT or ACT+ variants for this reason.

Implementation: LeRobot Code Snippets

Both algorithms are available in HuggingFace LeRobot. Below are the minimal training invocations.

Training ACT with LeRobot

python lerobot/scripts/train.py \
  policy=act \
  env=aloha \
  dataset_repo_id=lerobot/aloha_sim_insertion_human \
  hydra.run.dir=outputs/train/act_insertion \
  training.num_workers=4 \
  training.batch_size=8 \
  training.num_epochs=2000 \
  policy.chunk_size=100 \
  policy.n_action_steps=100 \
  policy.temporal_ensemble_momentum=0.01

Key ACT hyperparameters to tune:

chunk_size: Number of timesteps predicted per forward pass. Default 100 at 50 Hz = 2 seconds. Reduce to 50 for faster tasks.
temporal_ensemble_momentum: Controls how aggressively overlapping chunks are blended. Lower = smoother but more lagged.
kl_weight: Weight on the CVAE KL term. Default 10.0. Increase if the policy mode-collapses on simple tasks.

Training Diffusion Policy with LeRobot

python lerobot/scripts/train.py \
  policy=diffusion \
  env=pusht \
  dataset_repo_id=lerobot/pusht \
  hydra.run.dir=outputs/train/diffusion_pusht \
  training.num_workers=4 \
  training.batch_size=64 \
  training.num_epochs=2000 \
  policy.n_action_steps=8 \
  policy.horizon=16 \
  policy.num_inference_steps=100 \
  policy.num_train_timesteps=100

Key Diffusion Policy hyperparameters to tune:

num_inference_steps: DDIM denoising steps at inference. Default 100 (DDPM). Set to 10–20 for real-time use (DDIM scheduler).
horizon: Full prediction window. Higher = smoother trajectories; lower = more reactive. 16 is a good starting point for table-top tasks.
n_action_steps: How many actions from the horizon to actually execute before re-querying. Typically horizon/2.

Switching Between Schedulers (DDPM vs DDIM)

# In your config or Python override:
# Use DDIM for fast inference (10 steps instead of 100)
policy.noise_scheduler._target_=diffusers.DDIMScheduler
policy.num_inference_steps=10

# DDIM produces near-identical quality at 10x faster inference
# Benchmark: 100-step DDPM ~180ms → 10-step DDIM ~18ms on RTX 3090

ACT+ and Variants Worth Knowing

ACT+ (from the "ALOHA Unleashed" paper, 2024) extends ACT with additional training tricks: larger ResNet backbone, improved data augmentation, and joint position + velocity targets. It achieves ~15% higher success rates than vanilla ACT on the same ALOHA tasks. If you are starting a new project with ALOHA or DK1 hardware, prefer ACT+ over vanilla ACT.

Diffusion Policy Transformer (DP-T) replaces the U-Net noise predictor with a transformer, enabling better long-range temporal modeling and scaling. On complex tasks (30+ second episodes), DP-T outperforms the CNN-based variant by ~8% success rate but requires more compute and data.

The Bottom Line: Decision Tree

Follow this decision process:

If bimanual manipulation → ACT (or ACT+)
If under 100 demos → ACT
If need <10ms inference → ACT
If task has multimodal action distribution (multiple valid grasps, approach angles) → Diffusion Policy
If dexterous hand control (5+ finger DoF) → Diffusion Policy
If large dataset (>500 demos) and scaling → Diffusion Policy
If unsure → train both on a 50-demo pilot split and compare; budget ~2 days of compute

Data Collection Considerations

For either algorithm, demonstration quality matters more than algorithm choice for small data regimes (<200 demos). Consistent teleoperation speed, smooth trajectories, and good camera coverage dominate. We offer data collection services — teleoperation, HDF5/LeRobot formatting, QA — for teams that want to iterate fast without building the full collection pipeline.

For hardware: ACT was designed around ALOHA (low-cost bimanual). Diffusion Policy is hardware-agnostic. Both run on OpenArm 1 and DK1 with the LeRobot SDK.

ACT Model Page → Diffusion Policy Model Page → VLA Models Comparison →