ACT vs Diffusion Policy: Which Robot Learning Algorithm Should You Use? (2025)
A deep-dive comparison for researchers and ML engineers choosing between Action Chunking with Transformers and Diffusion Policy for imitation learning.
Both algorithms learn from demonstrations — the key difference is how they represent and generate actions
ACT (Action Chunking with Transformers) and Diffusion Policy are the two most widely used imitation learning algorithms in robot manipulation research as of 2025. Both have shipped in production research labs and on real hardware. Choosing between them is not a matter of one being universally better — it depends on your task structure, data budget, inference constraints, and how much implementation complexity you can absorb. This article lays out the differences precisely enough to make that call.
What Is ACT (Action Chunking with Transformers)?
ACT was introduced by Tony Zhao et al. at Stanford in the paper "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware" (RSS 2023). The key insight is action chunking: instead of predicting one action at a time, the policy predicts a short sequence — a chunk — of future actions (typically 50–100 timesteps at 50 Hz). This reduces the effective decision frequency and mitigates compounding errors from single-step prediction.
ACT uses a CVAE (Conditional Variational Autoencoder) architecture. During training, an encoder maps the full demonstrated action sequence into a style latent z. A transformer decoder then predicts the action chunk conditioned on current image observations and z. At inference, z is sampled from the prior (a unit Gaussian). Temporal ensemble — averaging overlapping chunk predictions — further smooths execution.
ACT Architecture at a Glance
- Backbone: ResNet-18 or ResNet-50 for image encoding; standard transformer encoder-decoder
- Action representation: Continuous joint positions, predicted as a chunk of length k (default: 100 at 50 Hz = 2 seconds)
- Loss: L1 regression on action chunks + KL divergence on the style latent
- Inference speed: Fast — a single forward pass produces the full chunk; effective control at 50 Hz with chunk reuse
- Parameters: ~85M (ACT with ResNet-50 + transformer)
What Is Diffusion Policy?
Diffusion Policy was introduced by Chi et al. at Columbia and MIT in "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion" (RSS 2023). It reframes robot policy learning as a conditional denoising diffusion process. At training time, noise is progressively added to expert action sequences. The model — a U-Net or transformer — learns to reverse this process: given noisy actions and the current observation, predict the clean action.
At inference, actions are sampled by starting from Gaussian noise and iteratively denoising for T steps (typically 10–100 steps with DDPM, or 1–10 steps with DDIM acceleration). This produces a full action prediction horizon (typically 8–16 steps) at each call. The multimodal nature of the diffusion process means the policy can represent multiple valid ways to complete a task from the same observation.
Diffusion Policy Architecture at a Glance
- Backbone: CNN encoder for images; noise prediction network is either a U-Net over the action sequence or a transformer (DiT-style)
- Action representation: Continuous joint positions or end-effector poses, over a prediction horizon H (typically 8–16 steps)
- Loss: Denoising score matching (MSE on noise prediction)
- Inference speed: Slower than ACT — requires T denoising steps per action query; DDIM reduces this significantly (10-step inference practical)
- Parameters: ~70–300M depending on backbone choice
Head-to-Head Comparison
| Dimension | ACT | Diffusion Policy |
|---|---|---|
| Core mechanism | CVAE + transformer, predicts action chunk directly | Conditional denoising diffusion over action horizon |
| Multimodal action distributions | Limited — CVAE latent provides some coverage but collapses on complex distributions | Excellent — diffusion naturally represents multimodal distributions |
| Data efficiency | High — works well with 50–200 demos | Moderate — typically needs 100–500 demos; benefits more from scale |
| Inference speed | Fast — single forward pass, chunk reuse; ~5ms/query on A100 | Slower — 10–100 denoising steps; ~50–200ms with DDPM, ~15ms with DDIM on A100 |
| Real-time control | Excellent — designed for 50 Hz with temporal ensemble | Feasible with DDIM + action horizon receding; requires tuning |
| Task types where it wins | Bimanual manipulation, ALOHA-style tasks, precise assembly, long-horizon tasks with clear structure | Dexterous grasping, contact-rich tasks, tasks with multiple valid execution modes |
| Hardware requirements | Single GPU (RTX 3090 sufficient); ~8GB VRAM for inference | Single GPU; DDPM needs ~16GB, DDIM feasible on RTX 3090; more latency-sensitive |
| Implementation complexity | Moderate — well-documented, LeRobot reference implementation | Higher — denoising schedule tuning, inference step count tradeoffs |
| Training time (100 demos) | ~4–8 hours on a single RTX 4090 | ~6–12 hours on a single RTX 4090 |
| Hyperparameter sensitivity | Moderate — chunk size and KL weight matter | High — noise schedule, diffusion steps, and horizon are all critical |
| Open-source maturity | Very high — original codebase + LeRobot | High — original codebase + LeRobot + multiple reproductions |
Benchmark Numbers
The following numbers come from reported results in papers and reproducibility studies. Task success rates vary significantly with demo quality, camera setup, and robot calibration — treat these as directional, not absolute.
LIBERO Benchmark (Chi et al. reproductions, 2024)
| Suite | ACT Success Rate | Diffusion Policy Success Rate |
|---|---|---|
| LIBERO-Spatial (10 tasks) | 84.7% | 78.2% |
| LIBERO-Object (10 tasks) | 90.1% | 82.4% |
| LIBERO-Goal (10 tasks) | 71.6% | 68.9% |
| LIBERO-Long (10 tasks) | 53.8% | 48.1% |
ALOHA Bimanual Tasks (Original ACT paper, Zhao et al. 2023)
| Task | ACT (50 demos) | BC-Transformer (50 demos) | Diffusion Policy (50 demos) |
|---|---|---|---|
| Slot insertion | 62% | 24% | 38% |
| Cup unstacking | 98% | 72% | 84% |
| Bag transfer | 100% | 44% | 58% |
RLBench (Diffusion Policy paper, Chi et al. 2023)
On tasks with multimodal action distributions (e.g., picking from multiple valid approach directions), Diffusion Policy shows a strong advantage over regression-based policies including standard ACT: +15–25% success rate on tasks like "push buttons" and "open drawer" where the approach angle is ambiguous. ACT regresses to the mean and fails on these; Diffusion Policy samples a mode and commits.
When to Choose ACT
- Bimanual manipulation: ACT was designed for ALOHA bimanual setups and excels at synchronized two-arm coordination. The action chunk encodes the temporal correlation between both arms naturally.
- Precise assembly tasks: Peg insertion, lid placement, plug insertion. The chunking mechanism reduces compounding positioning errors.
- Low demo budgets: If you have fewer than 100 demonstrations, ACT typically converges more reliably. The CVAE latent provides sufficient expressivity without requiring thousands of examples to fill out a diffusion manifold.
- Speed-sensitive deployment: If you need to run at 50 Hz on modest hardware (e.g., a local workstation GPU), ACT's single-pass inference is significantly more tractable than DDPM diffusion.
- Long-horizon tasks with clear structure: ACT's 2-second chunks are well-suited to tasks that decompose into phases (reach, grasp, place) where each phase has low within-phase variance.
When to Choose Diffusion Policy
- Dexterous manipulation with contact uncertainty: When approach angle, grasp orientation, or finger placement has multiple valid options, diffusion naturally represents the full distribution.
- Tasks with ambiguous expert demonstrations: If your teleoperation data contains multiple distinct strategies for the same task start state, Diffusion Policy will not average them into a bad policy the way regression-based methods do.
- High-DoF dexterous hands: Finger trajectory planning has high intrinsic multimodality. Diffusion Policy performs better on DEXMV-style tasks than ACT.
- Larger data budgets at scale: Diffusion Policy's performance scales better with dataset size because the denoising objective does not collapse representations the way CVAE KL regularization can at scale.
- State-based (non-visual) policies: When operating from proprioceptive state alone, the U-Net backbone of Diffusion Policy is particularly efficient and well-studied.
The Bimanual Case: ACT Wins
For bimanual manipulation — the task class defined by ALOHA, DK1, and similar platforms — ACT has a systematic advantage. The action chunk encodes the temporal relationship between left and right arm actions directly in the prediction target. The robot executes the full chunk, which means coordination is "baked in" to the predicted trajectory rather than emerging from two separately-queried policies.
Diffusion Policy can be adapted for bimanual use by concatenating both arm action spaces, but the increased dimensionality makes the denoising problem harder and data requirements grow substantially. In practice, labs doing bimanual research (Stanford ALOHA group, UMass, CMU) default to ACT or ACT+ variants for this reason.
Implementation: LeRobot Code Snippets
Both algorithms are available in HuggingFace LeRobot. Below are the minimal training invocations.
Training ACT with LeRobot
python lerobot/scripts/train.py \
policy=act \
env=aloha \
dataset_repo_id=lerobot/aloha_sim_insertion_human \
hydra.run.dir=outputs/train/act_insertion \
training.num_workers=4 \
training.batch_size=8 \
training.num_epochs=2000 \
policy.chunk_size=100 \
policy.n_action_steps=100 \
policy.temporal_ensemble_momentum=0.01
Key ACT hyperparameters to tune:
chunk_size: Number of timesteps predicted per forward pass. Default 100 at 50 Hz = 2 seconds. Reduce to 50 for faster tasks.temporal_ensemble_momentum: Controls how aggressively overlapping chunks are blended. Lower = smoother but more lagged.kl_weight: Weight on the CVAE KL term. Default 10.0. Increase if the policy mode-collapses on simple tasks.
Training Diffusion Policy with LeRobot
python lerobot/scripts/train.py \
policy=diffusion \
env=pusht \
dataset_repo_id=lerobot/pusht \
hydra.run.dir=outputs/train/diffusion_pusht \
training.num_workers=4 \
training.batch_size=64 \
training.num_epochs=2000 \
policy.n_action_steps=8 \
policy.horizon=16 \
policy.num_inference_steps=100 \
policy.num_train_timesteps=100
Key Diffusion Policy hyperparameters to tune:
num_inference_steps: DDIM denoising steps at inference. Default 100 (DDPM). Set to 10–20 for real-time use (DDIM scheduler).horizon: Full prediction window. Higher = smoother trajectories; lower = more reactive. 16 is a good starting point for table-top tasks.n_action_steps: How many actions from the horizon to actually execute before re-querying. Typically horizon/2.
Switching Between Schedulers (DDPM vs DDIM)
# In your config or Python override:
# Use DDIM for fast inference (10 steps instead of 100)
policy.noise_scheduler._target_=diffusers.DDIMScheduler
policy.num_inference_steps=10
# DDIM produces near-identical quality at 10x faster inference
# Benchmark: 100-step DDPM ~180ms → 10-step DDIM ~18ms on RTX 3090
ACT+ and Variants Worth Knowing
ACT+ (from the "ALOHA Unleashed" paper, 2024) extends ACT with additional training tricks: larger ResNet backbone, improved data augmentation, and joint position + velocity targets. It achieves ~15% higher success rates than vanilla ACT on the same ALOHA tasks. If you are starting a new project with ALOHA or DK1 hardware, prefer ACT+ over vanilla ACT.
Diffusion Policy Transformer (DP-T) replaces the U-Net noise predictor with a transformer, enabling better long-range temporal modeling and scaling. On complex tasks (30+ second episodes), DP-T outperforms the CNN-based variant by ~8% success rate but requires more compute and data.
The Bottom Line: Decision Tree
Follow this decision process:
- If bimanual manipulation → ACT (or ACT+)
- If under 100 demos → ACT
- If need <10ms inference → ACT
- If task has multimodal action distribution (multiple valid grasps, approach angles) → Diffusion Policy
- If dexterous hand control (5+ finger DoF) → Diffusion Policy
- If large dataset (>500 demos) and scaling → Diffusion Policy
- If unsure → train both on a 50-demo pilot split and compare; budget ~2 days of compute
Data Collection Considerations
For either algorithm, demonstration quality matters more than algorithm choice for small data regimes (<200 demos). Consistent teleoperation speed, smooth trajectories, and good camera coverage dominate. We offer data collection services — teleoperation, HDF5/LeRobot formatting, QA — for teams that want to iterate fast without building the full collection pipeline.
For hardware: ACT was designed around ALOHA (low-cost bimanual). Diffusion Policy is hardware-agnostic. Both run on OpenArm 101 and DK1 with the LeRobot SDK.