Why this comparison matters
When you collect your own teleop demonstrations — on an OpenArm, ALOHA, or Unitree G1 — you will train either Diffusion Policy or ACT long before you touch a foundation VLA. These two architectures define the current default for pure imitation learning: both use action chunking, both are MIT-licensed, both have reference implementations in LeRobot, and both are routinely cited as baselines in 2024–2026 robot learning papers.
But they have very different inductive biases. If you pick the wrong one for your demonstration distribution, your policy will either collapse to the mean (bad) or memorize the demos (also bad). This page walks through when each shines, where each fails, and how to choose.
At-a-glance comparison
| Dimension | Diffusion Policy | ACT (Action Chunking Transformer) |
|---|---|---|
| Origin | Chi et al., Columbia, RSS 2023 | Zhao et al., Stanford (Mobile ALOHA), RSS 2023 |
| Architecture | CNN or transformer backbone + conditional diffusion (DDPM) action head | ResNet visual encoder + transformer encoder-decoder with CVAE latent |
| Action representation | Chunk of future actions denoised from Gaussian noise | Chunk of future actions decoded autoregressively (or in parallel) |
| Training objective | Denoising score matching / DDPM loss | Reconstruction + KL regularizer (CVAE) |
| Typical chunk size | 8–16 steps | 50–100 steps (aggressive chunking) |
| Multi-modal capture | Strong — native to the diffusion formulation | Partial — latent captures some multi-modality but can mode-collapse |
| Data needed | ~50–200 demos for narrow tasks; more helps | ~50 demos often enough for bimanual teleop tasks |
| Training time | Slower (diffusion steps increase wall-clock) | Fast — minutes to hours on a single GPU |
| Inference speed | Needs iterative denoising; DDIM or consistency models help | Fast single forward pass; chunked execution amortizes |
| Best-known wins | Push-T, real-world manipulation with multi-modal demos (+46.9% vs prior methods) | Mobile ALOHA bimanual tasks, dexterous bimanual manipulation |
| License | MIT | MIT |
| Code | github.com/real-stanford/diffusion_policy | github.com/tonyzhaozh/act |
Diffusion Policy: modeling the distribution, not the mean
The Diffusion Policy insight is that expert demonstrations are almost always multi-modal. A human teleoperator grasping a mug might use a side grasp or a top grasp on different trials; a classical MSE or cross-entropy policy trained on that data will average the two and produce neither. Diffusion Policy sidesteps this by modeling the distribution of actions via a denoising diffusion probabilistic model (DDPM). At inference time, you sample a noise vector and iteratively denoise it into an action chunk conditioned on visual observations.
The original Diffusion Policy paper reported a 46.9% average improvement over the previous best imitation learning methods across 15 tasks, and the architecture became a go-to backbone for downstream work like Octo and 3D Diffusion Policy. The cost is inference: a naive implementation runs K diffusion steps per chunk, which can push control loops below 10 Hz. Most production deployments use DDIM with 5–10 steps, or a consistency-distilled variant, to get real-time control.
When Diffusion Policy shines
- Demonstrations contain multiple valid strategies (grasp types, approach directions, hand-off variations).
- Contact-rich manipulation where small action perturbations compound — see contact-rich manipulation guide.
- You care about robustness to perturbations and have budget for iterative denoising inference.
- You want the policy to scale with more data rather than plateauing.
ACT: the simple transformer that stole the bimanual crown
ACT was introduced alongside the Mobile ALOHA and ALOHA bimanual hardware as the reference imitation learning stack. The architecture is deliberately simple: a ResNet encodes each camera view, a transformer encoder fuses them with proprioception, and a transformer decoder outputs a chunk of 50–100 future actions conditioned on a learned CVAE latent. At inference time, you execute the chunk with temporal ensembling (averaging overlapping predictions), which smooths the output and reduces compounding error.
ACT's magic is the chunk length and the ensembling. By predicting far-future actions, it avoids the classical "compounding error" problem of step-by-step imitation learning. By ensembling overlapping chunks, it averages out high-frequency noise. The result: on bimanual teleop tasks (cup stacking, velcro strapping, battery insertion), ACT regularly succeeds with 50 demonstrations, whereas a standard behavior cloning baseline would need thousands. ACT is the default starting policy for ALOHA rigs, and the LeRobot framework ships a canonical ACT implementation.
When ACT shines
- Bimanual teleoperation data — ACT was designed for this and it shows.
- You need to train fast and iterate on task design rather than wait for diffusion to converge.
- Control frequency matters and you do not want iterative denoising in your loop.
- Demonstrations are relatively consistent (a single preferred strategy).
Data requirements
Both architectures are data-efficient by modern VLA standards — you are training from scratch on a narrow task, not fine-tuning a 7B model. In practice, 50 demonstrations will get ACT to a reasonable success rate on a bimanual manipulation task, and Diffusion Policy typically needs 100–200 demonstrations to match ACT on multi-modal tasks (and then exceed it as demonstrations scale). If you have not collected data yet, SVRC's teleoperation services capture the 20–50 Hz bimanual recordings both architectures expect, and our ALOHA dataset page has ready-made starting corpora.
Inference and deployment
ACT's single forward pass makes it the easier deployment target. Chunk execution with temporal ensembling runs comfortably at 30–50 Hz on a laptop GPU. Diffusion Policy's iterative denoising is the harder case — but the community has converged on DDIM with 5–10 steps or consistency-distilled variants, which bring it to competitive frequencies. If you are deploying on an edge device, lean toward ACT; if you are running on a workstation with a 4090 or better, either is fine.
Honest tradeoffs
ACT's elegance is also its limit. On tasks with genuinely multi-modal demonstrations, ACT's CVAE can mode-collapse to the average strategy, producing timid or indecisive behavior. Diffusion Policy's distributional modeling handles this gracefully but adds inference complexity and training time. A useful heuristic: if your demonstrations are consistent, start with ACT; if your demonstrations are diverse, start with Diffusion Policy. You can always re-train with the other if the first pass disappoints.
Both algorithms are also "just" imitation learning — they do not generalize across embodiments the way a foundation VLA does. If you need cross-robot transfer, pair one of these with OpenVLA or Octo and use ACT/Diffusion Policy as the embodiment-specific action head.
Benchmarks to consult
Both architectures appear on LIBERO and in the Push-T / Robomimic suites. The ACT vs Diffusion Policy decision guide on our blog goes deeper into suite-level comparison.
Implementation details that actually matter
Both algorithms are simpler than most VLA papers would suggest, but the details that determine success or failure often get lost in the abstract. For Diffusion Policy, the three most common pitfalls are: (1) under-training the denoiser — teams stop at 50K steps when 200K is routinely needed, (2) mis-tuning the action normalization, especially when actions include both position and gripper bits, and (3) not using EMA weights at inference. Get those three right and Diffusion Policy "just works" on a surprising range of tasks.
For ACT, the critical knobs are chunk length and temporal ensembling gain. Short chunks (under 20 steps) make ACT behave like vanilla behavior cloning and reintroduce compounding error. Very long chunks (over 200) make the policy brittle to perturbations. The ALOHA default of 100 steps with exponentially weighted ensembling is a strong starting point — tune down only if you see jitter or up only if you see long-horizon drift.
Combining the two with a foundation VLA
In 2026, a growing number of production stacks do not pick only ACT or only Diffusion Policy — they use one as the low-level action head behind a foundation VLA. The pattern: a VLA like OpenVLA parses the natural-language instruction into a scene-aware sub-goal, and an ACT or Diffusion Policy head executes the fine-grained motion. This is how most of the best performing CALVIN and LIBERO submissions in 2025 were structured, and it reflects a maturing division of labor in the field: foundation models for language and abstraction, imitation learning policies for dexterous execution.
Cost and engineering effort
Neither algorithm is expensive by modern standards. A single 24 GB GPU (RTX 4090 or A6000) will train either ACT or Diffusion Policy on a 50-demo dataset in under a day. ACT usually converges faster in wall-clock terms because each step is a single transformer forward-backward; Diffusion Policy's multi-step loss is heavier but converges in fewer epochs. Expect 4–12 hours of total training time for a decent policy on a narrow task — with the caveat that data collection and curation takes far longer than training itself.
Our recommendation
For bimanual teleop tasks — especially on ALOHA or G1 style hardware — start with ACT. You will be running in an afternoon and the baseline is hard to beat with 50 demos. For single-arm manipulation with varied demonstrations or contact-rich tasks, start with Diffusion Policy. When in doubt, train both — they are cheap enough to run in parallel, and seeing both succeed or both fail tells you something important about your dataset.