Robot Learning

Diffusion Policy for Robot Learning: How It Works, When to Use It (2026)

Diffusion policy is the leading imitation learning algorithm for contact-rich and dexterous robot manipulation tasks. It consistently outperforms standard behavioral cloning — and often beats ACT — on tasks with multimodal demonstrations. This guide explains how it works, how it compares to ACT and BC, which hyperparameters matter most, and what hardware to pair it with at SVRC.

What Is Diffusion Policy?

Diffusion policy, introduced by Chi et al. at Columbia University in 2023, applies the score-matching framework from image generation diffusion models (DDPM, DDIM) to robot action prediction. Instead of predicting a single next action or a short action sequence from an observation, the policy learns to iteratively denoise a random action trajectory into a coherent, high-quality motion plan. The result is an action policy that is, in practice, significantly better at handling the multimodal nature of human demonstration data than classical behavioral cloning.

The core insight is simple: when you collect robot demonstrations from human operators, different operators often complete the same task in qualitatively different ways — they might approach an object from the left side or the right, push before pulling, or grip from the top versus the side. These are all valid, task-completing behaviors. A standard behavioral cloning model trained with mean squared error will average across these modes and produce an action that is "in between" two valid behaviors — which is typically invalid itself. Diffusion policy, because it models the full conditional distribution of actions given observations, can represent and sample from each mode independently.

The paper demonstrated diffusion policy on 12 simulated and real-robot manipulation tasks, achieving state-of-the-art results on 11 of them. The real-robot tasks included pushing a T-shaped block, bimanual spreading sauce on a tortilla, and a precision tool-pick task with 1mm placement tolerance.

How Diffusion Policy Works

Training and inference for diffusion policy follow the standard DDPM denoising framework, adapted to action sequences instead of images.

Training

During training, a clean action trajectory a_0 (sampled from the demonstration dataset) is progressively corrupted by adding Gaussian noise across T timesteps (typically T=100) according to a noise schedule. The neural network — either a 1D temporal CNN or a transformer — is trained to predict the noise that was added at each step, given the noisy action, the denoising step index, and the current robot observation (image + proprioception). This is the standard score-matching objective. The network learns to map from (noisy action, timestep, observation) → predicted noise, and the MSE between predicted and true noise is minimized across all training pairs.

Inference

At inference time, the model starts from a pure noise vector of the same dimensionality as an action sequence. It then runs K denoising steps (typically K=10 for DDIM, K=100 for DDPM) — each step calling the neural network to predict the noise component and subtracting it, guided by the current observation. After K steps, the result is a complete action sequence (typically 16–32 steps ahead at 10Hz, or 50–100 steps at 50Hz) that represents a plausible, smooth continuation of the current state.

The observation conditioning is the key architectural choice. Diffusion policy supports two encoder options:

CNN-based (DP-C): A ResNet-18 or ResNet-34 backbone encodes image observations into a feature vector, which is then fused with proprioceptive state and used to condition the denoising CNN. Faster to train and infer; better for single-camera setups.
Transformer-based (DP-T): A ViT-style transformer encodes image tokens and fuses them with proprioception tokens. More expressive for multi-camera setups; slower inference (120ms vs 40ms on RTX 3090).

Receding Horizon Control

A critical implementation detail: diffusion policy uses a receding horizon approach. The policy predicts an action sequence of length Tp (e.g., 16 steps), but only executes the first Ta steps (e.g., 8 steps) before re-planning. This creates a balance between temporal consistency (longer Tp = smoother motions) and reactivity (shorter Ta = faster response to unexpected changes). The ratio Tp/Ta = 2 is the standard default and works well for most manipulation tasks.

Diffusion Policy vs ACT: A Practical Comparison

Property	Diffusion Policy	ACT (Action Chunking Transformers)
Multimodal action distributions	Excellent — models full distribution	Good — CVAE captures variance but can mode-average
Inference latency (RTX 3090)	40–120ms (DDIM K=10 to K=100)	~10ms (single forward pass)
Training time (50 demos, RTX 3090)	4–8 hours (CNN), 8–16 hours (Transformer)	2–4 hours
Temporal consistency	High — denoising produces smooth trajectories	High — action chunking provides consistency
Bimanual coordination	Good with transformer variant	Excellent — natively designed for ALOHA bimanual
Dexterous hand control (20+ DOF)	Excellent — handles high-dim action spaces well	Moderate — chunk size tuning needed for high DOF
Open-source reference implementation	Yes — diffusion_policy repo (Columbia)	Yes — act repo (Stanford)
LeRobot integration	Yes — native support	Yes — native support

The headline comparison: ACT is faster at inference and easier to train, which makes it better for time-critical systems and teams with limited compute. Diffusion policy handles multimodal demonstrations more naturally and scales better to high-dimensional action spaces like dexterous hands. In practice, for most tabletop manipulation tasks with 50–200 clean demonstrations, performance is comparable. Choose diffusion policy when your demonstrations are naturally varied or your action space exceeds 14 DOF. See our ACT policy guide for a deeper dive into ACT-specific tuning.

Diffusion Policy vs Standard Behavioral Cloning

Standard behavioral cloning (BC) minimizes the MSE between predicted and demonstrated actions. This works when the demonstration dataset is unimodal — when every operator does the task the same way. The problem is that human demonstrations almost never are. When a BC policy sees an observation that appeared in demonstrations with two different subsequent actions, it averages them, producing a motion that is wrong in both cases. This is called the mode-averaging problem.

Diffusion policy avoids mode-averaging because it models the entire conditional distribution p(action | observation) rather than just the conditional mean E[action | observation]. Given the same ambiguous observation, a diffusion policy samples coherently from one mode or the other — never averaging them. This is why the original paper showed the largest improvements over BC on tasks that had the highest demonstration variance: the pushing task (operators pushed the block from different angles) and the bimanual spreading task (operators used different coordination strategies).

Practically, if your task has variance less than 5% in demonstrated trajectories (i.e., all operators do exactly the same thing), BC with a good backbone can match diffusion policy. If your task has natural variance — and most real tasks do — diffusion policy is worth the additional training time.

When to Choose Diffusion Policy

Choose Diffusion Policy When:

Your demonstrations are multimodal. If different operators complete the task via meaningfully different motion strategies, diffusion policy handles this better than ACT or BC.
You have a high-DOF action space. Dexterous hands (16–20+ DOF) benefit more from diffusion policy's ability to model joint correlations across the full action dimension than ACT's fixed chunk-based approach.
You need precise end-point positioning. The original paper showed diffusion policy achieves sub-2mm placement accuracy on precision insertion tasks due to the iterative refinement during denoising.
You are benchmarking against published results. Many recent papers (2024–2026) report diffusion policy as a baseline. Reproducing those numbers is easier with diffusion policy than converting to ACT.
You have more GPU budget for training. If you have multiple GPUs, diffusion policy training parallelizes well and the transformer variant achieves better results with more compute.

Choose ACT When:

Inference latency is critical. At 10ms per inference step, ACT can run at 100Hz on modest hardware. Diffusion policy at 40–120ms caps out at 8–25Hz for K=10–100.
You are working with ALOHA-format bimanual data. ACT was designed for ALOHA and the reference implementation requires no format conversion.
Training compute is limited. A 2-hour ACT training run versus 4–8 hours for diffusion policy matters when you are iterating quickly on task design.
You are matching to a specific prior result. If you need to reproduce Stanford's ACT numbers exactly, use ACT.

Key Hyperparameters for Diffusion Policy

These are the parameters that actually move the needle in practice, in roughly descending order of importance:

Parameter	Default	Effect of Increasing
`pred_horizon` (Tp)	16	Smoother trajectories; slower re-planning; more memory
`action_horizon` (Ta)	8	Fewer inference calls; less reactive to perturbations
`obs_horizon`	2	More temporal context; helps with tasks requiring memory of recent motion
`num_diffusion_iters` (K)	10 (DDIM), 100 (DDPM)	Higher quality actions; slower inference
`noise_scheduler`	DDIM	DDPM is higher quality but 10x slower; DDIM preferred for real-time use
`vision_encoder`	ResNet-18	ResNet-34 / ViT improves performance; requires more compute
`n_obs_steps`	2	Stacking more observation frames helps tasks with motion-based cues (sliding, rotating)

The most common mistake: setting pred_horizon too long and action_horizon too short. This produces policies that over-commit to a planned trajectory and fail to react to contact events. For dexterous manipulation, the recommended starting point is Tp=16, Ta=8 — then tune from there based on task failure modes you observe.

# Install diffusion policy (Columbia reference implementation)
git clone https://github.com/real-stanford/diffusion_policy.git
cd diffusion_policy
conda env create -f conda_environment.yaml
conda activate robodiff

# Train CNN variant on a Push-T dataset
python train.py --config-name=train_diffusion_unet_lowdim_push_t_workspace

# Train on your own dataset (LeRobot format)
python train.py --config-name=train_diffusion_unet_hybrid_workspace \
  task.dataset_path=/path/to/your/lerobot_dataset \
  task.pred_horizon=16 \
  task.action_horizon=8 \
  task.obs_horizon=2 \
  training.num_epochs=3000 \
  training.batch_size=256

Datasets That Work Well with Diffusion Policy

Diffusion policy performs best on datasets with these characteristics:

Smooth, continuous trajectories: Because the denoising process produces smooth outputs, demonstrations with jerky or discontinuous motions create a mismatch between the model's inductive bias and the training data. Pre-process demonstrations to remove sudden velocity jumps.
Consistent end-effector state recording: Diffusion policy in the standard CNN variant uses end-effector position+orientation+gripper as the action, not joint angles. If your dataset records joint angles, you need a forward kinematics layer or to train the joint-space variant.
Multiple views when available: The transformer variant (DP-T) can use 2–3 camera views as tokens. Multi-view setups consistently outperform single-view on tasks with occlusion or depth ambiguity.

Published Benchmark Datasets Compatible with Diffusion Policy

Dataset	Tasks	Notes
Push-T (Columbia)	T-block pushing	Reference benchmark; highly multimodal; download from diffusion_policy repo
RoboMimic	Can, Lift, Square, Transport	Standard benchmark; MH (multi-human) subset tests multimodal handling best
ALOHA / ACT datasets	Bimanual tabletop tasks	Requires joint-space conversion; diffusion policy-T performs comparably to ACT
DROID (Google)	76,000 diverse robot episodes	Large-scale; good for fine-tuning pretrained diffusion models
SVRC custom datasets	Varies by project	Delivered in LeRobot format; compatible with diffusion_policy repo out of the box

SVRC Hardware Recommendations for Diffusion Policy

Dexterous Tasks: Wuji Hand

For tasks that require finger-level dexterity — in-hand manipulation, grasp adjustment, object reorientation — the Wuji Hand is the strongest platform for diffusion policy at SVRC. With 20 active DOF and 768-point tactile sensing per fingertip, it provides both a rich action space and a rich observation space that diffusion policy's transformer variant can exploit fully. In our internal benchmarks, diffusion policy-T on Wuji Hand outperforms ACT on tasks with contact-rich finger interactions by 18–25 percentage points in success rate.

The 768-point tactile array produces a 768-dimensional observation vector per hand, which we flatten and tokenize for the DP-T transformer. Training time increases by roughly 40% versus a pure visual policy, but success rate on contact-sensitive tasks like pin insertion, cable routing, and surface inspection improves substantially.

Tabletop Manipulation: OpenArm Base

For standard tabletop pick-and-place, assembly, and tool-use tasks, the OpenArm Base is the most cost-effective platform for diffusion policy work. It runs at 50Hz joint-position control, supports the standard 6-DOF + gripper action space that diffusion policy's reference implementation targets, and exports data natively in LeRobot format. The CNN variant of diffusion policy trains in 4–6 hours on a single RTX 4090 using OpenArm datasets of 100 demonstrations.

For bimanual OpenArm setups (two arms on a shared mounting frame), the transformer variant is recommended — the additional cross-attention across the joint state vectors from both arms captures bimanual coordination patterns that the CNN variant misses.

Compute Requirements

Configuration	Minimum	Recommended
DP-C (CNN, single camera)	RTX 3060 12GB, 32GB RAM	RTX 3090 24GB, 64GB RAM
DP-T (Transformer, 3 cameras)	RTX 4080 16GB, 64GB RAM	RTX 4090 24GB or A100 40GB
DP-T + Wuji tactile (768D obs)	RTX 4090 24GB, 128GB RAM	A100 80GB or 2x RTX 4090

SVRC's data services include access to A100 training nodes for diffusion policy runs that exceed local compute capacity. Managed training jobs are scheduled on-demand — contact us to discuss GPU allocation for large dataset runs.

Frequently Asked Questions

What is diffusion policy in robotics?

Diffusion policy is an imitation learning algorithm that treats action prediction as a denoising diffusion process. At inference time, it starts from random Gaussian noise and iteratively refines it into a coherent action sequence conditioned on the current robot observation. This approach can represent multimodal action distributions — multiple valid ways to complete a task — which standard behavioral cloning and even ACT struggle with. It was introduced by Chi et al. at Columbia University in 2023.

When should I use diffusion policy instead of ACT?

Diffusion policy is the better choice when your task has multimodal demonstrations (operators completing the same task in meaningfully different ways), when you need precise end-point control, or when your demonstrations have high variance. ACT is faster at inference and requires less compute for training, making it better for latency-constrained applications and teams with limited GPU resources. For most tabletop manipulation tasks with 50–200 demonstrations, ACT and diffusion policy perform comparably. See our ACT vs diffusion policy guide for a detailed breakdown.

How long does it take to train diffusion policy?

Training CNN-based diffusion policy on a standard tabletop dataset (50–200 episodes) takes 4–8 hours on a single RTX 3090 or RTX 4090. The transformer-based variant takes 8–16 hours for the same dataset size. Training time scales roughly linearly with dataset size. SVRC's platform provides pre-configured training pipelines that reduce setup time to under 30 minutes.

What hardware works best with diffusion policy?

Diffusion policy was originally benchmarked on tasks requiring precise, smooth trajectories: pushing, insertion, and pick-and-place. For dexterous finger-level tasks, the Wuji Hand (20 DOF, 768-point tactile) paired with diffusion policy outperforms ACT on tasks with contact-rich manipulation. For tabletop arm tasks, OpenArm Base is the most cost-effective platform. SVRC supports both configurations with managed data collection and training pipelines.