Diffusion Policy vs ACT: Which Imitation Learning Algorithm to Choose?

Two of the most-deployed imitation learning algorithms in modern robotics — one built on denoising diffusion, the other on a CVAE transformer from the Mobile ALOHA project. Which captures your teleop demonstrations best?

Updated April 2026 Action chunking MIT-licensed
TL;DR Diffusion Policy models the multi-modal distribution of expert actions via DDPM denoising — gold standard when demonstrations show "multiple valid ways" to perform a task. ACT (Action Chunking Transformer) from Mobile ALOHA uses a CVAE transformer that predicts chunks of future actions conditioned on a learned latent — dead simple, trains fast, and shines on bimanual teleop data. Diffusion Policy wins on multi-modal robustness; ACT wins on simplicity, training speed, and bimanual fit.

Why this comparison matters

When you collect your own teleop demonstrations — on an OpenArm, ALOHA, or Unitree G1 — you will train either Diffusion Policy or ACT long before you touch a foundation VLA. These two architectures define the current default for pure imitation learning: both use action chunking, both are MIT-licensed, both have reference implementations in LeRobot, and both are routinely cited as baselines in 2024–2026 robot learning papers.

But they have very different inductive biases. If you pick the wrong one for your demonstration distribution, your policy will either collapse to the mean (bad) or memorize the demos (also bad). This page walks through when each shines, where each fails, and how to choose.

At-a-glance comparison

DimensionDiffusion PolicyACT (Action Chunking Transformer)
OriginChi et al., Columbia, RSS 2023Zhao et al., Stanford (Mobile ALOHA), RSS 2023
ArchitectureCNN or transformer backbone + conditional diffusion (DDPM) action headResNet visual encoder + transformer encoder-decoder with CVAE latent
Action representationChunk of future actions denoised from Gaussian noiseChunk of future actions decoded autoregressively (or in parallel)
Training objectiveDenoising score matching / DDPM lossReconstruction + KL regularizer (CVAE)
Typical chunk size8–16 steps50–100 steps (aggressive chunking)
Multi-modal captureStrong — native to the diffusion formulationPartial — latent captures some multi-modality but can mode-collapse
Data needed~50–200 demos for narrow tasks; more helps~50 demos often enough for bimanual teleop tasks
Training timeSlower (diffusion steps increase wall-clock)Fast — minutes to hours on a single GPU
Inference speedNeeds iterative denoising; DDIM or consistency models helpFast single forward pass; chunked execution amortizes
Best-known winsPush-T, real-world manipulation with multi-modal demos (+46.9% vs prior methods)Mobile ALOHA bimanual tasks, dexterous bimanual manipulation
LicenseMITMIT
Codegithub.com/real-stanford/diffusion_policygithub.com/tonyzhaozh/act

Diffusion Policy: modeling the distribution, not the mean

The Diffusion Policy insight is that expert demonstrations are almost always multi-modal. A human teleoperator grasping a mug might use a side grasp or a top grasp on different trials; a classical MSE or cross-entropy policy trained on that data will average the two and produce neither. Diffusion Policy sidesteps this by modeling the distribution of actions via a denoising diffusion probabilistic model (DDPM). At inference time, you sample a noise vector and iteratively denoise it into an action chunk conditioned on visual observations.

The original Diffusion Policy paper reported a 46.9% average improvement over the previous best imitation learning methods across 15 tasks, and the architecture became a go-to backbone for downstream work like Octo and 3D Diffusion Policy. The cost is inference: a naive implementation runs K diffusion steps per chunk, which can push control loops below 10 Hz. Most production deployments use DDIM with 5–10 steps, or a consistency-distilled variant, to get real-time control.

When Diffusion Policy shines

ACT: the simple transformer that stole the bimanual crown

ACT was introduced alongside the Mobile ALOHA and ALOHA bimanual hardware as the reference imitation learning stack. The architecture is deliberately simple: a ResNet encodes each camera view, a transformer encoder fuses them with proprioception, and a transformer decoder outputs a chunk of 50–100 future actions conditioned on a learned CVAE latent. At inference time, you execute the chunk with temporal ensembling (averaging overlapping predictions), which smooths the output and reduces compounding error.

ACT's magic is the chunk length and the ensembling. By predicting far-future actions, it avoids the classical "compounding error" problem of step-by-step imitation learning. By ensembling overlapping chunks, it averages out high-frequency noise. The result: on bimanual teleop tasks (cup stacking, velcro strapping, battery insertion), ACT regularly succeeds with 50 demonstrations, whereas a standard behavior cloning baseline would need thousands. ACT is the default starting policy for ALOHA rigs, and the LeRobot framework ships a canonical ACT implementation.

When ACT shines

Data requirements

Both architectures are data-efficient by modern VLA standards — you are training from scratch on a narrow task, not fine-tuning a 7B model. In practice, 50 demonstrations will get ACT to a reasonable success rate on a bimanual manipulation task, and Diffusion Policy typically needs 100–200 demonstrations to match ACT on multi-modal tasks (and then exceed it as demonstrations scale). If you have not collected data yet, SVRC's teleoperation services capture the 20–50 Hz bimanual recordings both architectures expect, and our ALOHA dataset page has ready-made starting corpora.

Inference and deployment

ACT's single forward pass makes it the easier deployment target. Chunk execution with temporal ensembling runs comfortably at 30–50 Hz on a laptop GPU. Diffusion Policy's iterative denoising is the harder case — but the community has converged on DDIM with 5–10 steps or consistency-distilled variants, which bring it to competitive frequencies. If you are deploying on an edge device, lean toward ACT; if you are running on a workstation with a 4090 or better, either is fine.

Honest tradeoffs

ACT's elegance is also its limit. On tasks with genuinely multi-modal demonstrations, ACT's CVAE can mode-collapse to the average strategy, producing timid or indecisive behavior. Diffusion Policy's distributional modeling handles this gracefully but adds inference complexity and training time. A useful heuristic: if your demonstrations are consistent, start with ACT; if your demonstrations are diverse, start with Diffusion Policy. You can always re-train with the other if the first pass disappoints.

Both algorithms are also "just" imitation learning — they do not generalize across embodiments the way a foundation VLA does. If you need cross-robot transfer, pair one of these with OpenVLA or Octo and use ACT/Diffusion Policy as the embodiment-specific action head.

Benchmarks to consult

Both architectures appear on LIBERO and in the Push-T / Robomimic suites. The ACT vs Diffusion Policy decision guide on our blog goes deeper into suite-level comparison.

Implementation details that actually matter

Both algorithms are simpler than most VLA papers would suggest, but the details that determine success or failure often get lost in the abstract. For Diffusion Policy, the three most common pitfalls are: (1) under-training the denoiser — teams stop at 50K steps when 200K is routinely needed, (2) mis-tuning the action normalization, especially when actions include both position and gripper bits, and (3) not using EMA weights at inference. Get those three right and Diffusion Policy "just works" on a surprising range of tasks.

For ACT, the critical knobs are chunk length and temporal ensembling gain. Short chunks (under 20 steps) make ACT behave like vanilla behavior cloning and reintroduce compounding error. Very long chunks (over 200) make the policy brittle to perturbations. The ALOHA default of 100 steps with exponentially weighted ensembling is a strong starting point — tune down only if you see jitter or up only if you see long-horizon drift.

Combining the two with a foundation VLA

In 2026, a growing number of production stacks do not pick only ACT or only Diffusion Policy — they use one as the low-level action head behind a foundation VLA. The pattern: a VLA like OpenVLA parses the natural-language instruction into a scene-aware sub-goal, and an ACT or Diffusion Policy head executes the fine-grained motion. This is how most of the best performing CALVIN and LIBERO submissions in 2025 were structured, and it reflects a maturing division of labor in the field: foundation models for language and abstraction, imitation learning policies for dexterous execution.

Cost and engineering effort

Neither algorithm is expensive by modern standards. A single 24 GB GPU (RTX 4090 or A6000) will train either ACT or Diffusion Policy on a 50-demo dataset in under a day. ACT usually converges faster in wall-clock terms because each step is a single transformer forward-backward; Diffusion Policy's multi-step loss is heavier but converges in fewer epochs. Expect 4–12 hours of total training time for a decent policy on a narrow task — with the caveat that data collection and curation takes far longer than training itself.

Our recommendation

For bimanual teleop tasks — especially on ALOHA or G1 style hardware — start with ACT. You will be running in an afternoon and the baseline is hard to beat with 50 demos. For single-arm manipulation with varied demonstrations or contact-rich tasks, start with Diffusion Policy. When in doubt, train both — they are cheap enough to run in parallel, and seeing both succeed or both fail tells you something important about your dataset.