Quick Summary

ACT (Action Chunking with Transformers, Zhao et al. 2023) uses a CVAE to produce action chunks — sequences of 20-100 future actions predicted at once. It is fast (50ms inference on a single GPU), deterministic given a latent sample, and works well with 50-500 demonstrations. Diffusion Policy (Chi et al. 2023) uses a conditional denoising diffusion model over the action space. It is slower (200-500ms with DDPM, 50ms with consistency distillation), explicitly multi-modal, and handles precision tasks and multi-step tasks better with more data.

Inference Latency: The Practical Constraint

For real-time robot control, inference latency is a hard constraint that often determines algorithm choice before any other consideration.

AlgorithmInference TimeControl HzNotes
ACT50 ms20 HzExecutes 20-step chunks; real inference rate much lower
Diffusion Policy (DDPM)500 ms2 HzToo slow for reactive tasks without chunking
Diffusion Policy (DDIM 10 steps)200 ms5 HzAcceptable for slow manipulation
Consistency Policy50 ms20 HzSingle denoising step; matches ACT latency
ACT + temporal ensemble50 ms20 HzSmoothing across multiple chunk predictions

If your task requires reactive high-frequency control (catching, pouring, high-speed assembly), standard Diffusion Policy DDPM is simply too slow. Use ACT or Consistency Policy. If your task involves slow, precision manipulation where 200ms latency is acceptable, DDIM Diffusion Policy is viable and may produce better results.

Task Type Recommendations

  • Fast reactive tasks (catching, stirring, high-speed pick-place): ACT or Consistency Policy. The 20Hz control rate allows real-time adaptation. Diffusion DDPM at 2Hz cannot respond to moving objects.
  • Precision tasks with multiple viable solutions (peg-in-hole, USB insertion, cloth folding): Diffusion Policy. Its explicit multi-modality allows it to model the distribution over successful grasp strategies, not just the mean — which is critical for precision tasks where the mean strategy often fails.
  • Long-horizon tasks over 10 steps: Neither algorithm alone is sufficient. Use hierarchical policies: a task planner selects subtask sequences, then ACT or Diffusion Policy executes individual subtasks.
  • Tasks where object position varies significantly: Both work, but Diffusion Policy tends to generalize better to novel object positions when trained with sufficient data.

Training Data Requirements

AlgorithmMinimum DemosRecommendedTraining Time (single GPU)
ACT50100–5002–4 hours
Diffusion Policy (DDPM)200500–20006–12 hours
Consistency Policy100300–10004–8 hours

ACT's lower data requirement is a genuine advantage for new task exploration where collecting 500+ demos is expensive. However, Diffusion Policy often catches up and surpasses ACT when more data is available — particularly for precision tasks. If you have a data budget above 1,000 demonstrations, seriously evaluate Diffusion Policy.

Hyperparameter Sensitivity

ACT's critical hyperparameter is the KL divergence weight in the CVAE loss. Too low: the latent space collapses and the policy ignores context. Too high: the policy ignores demonstrations and regresses to the mean. Standard recommendation: start at 10 and sweep 1, 5, 10, 50 on a small dataset before full training.

Diffusion Policy is more sensitive to learning rate schedule and noise variance schedule. The original DDPM implementation uses a cosine noise schedule with linear warmup LR; these defaults work well in practice. The most common mistake is using too high a learning rate (>3e-4 with Adam), which causes training instability on contact-rich data.

Decision Matrix

ConditionRecommended Algorithm
< 200 demos availableACT
Reactive/high-speed taskACT or Consistency Policy
Precision with multi-modal solutionsDiffusion Policy
> 500 demos, slow taskDiffusion Policy
Deployment on low-compute hardwareACT (lighter model)
Want to use HuggingFace LeRobotBoth supported

Both algorithms are available as reference implementations in the SVRC data platform, with pre-built training configs for OpenArm 101 and common camera setups.