Diffusion Policy for Robot Learning: What It Is and How to Use It
Diffusion Policy, introduced by Chi et al. in 2023, brought the generative modeling revolution to robot control. By treating action generation as a denoising problem, it handles the multimodal, high-dimensional nature of manipulation behavior in ways that simpler behavioral cloning algorithms cannot. Here is what you need to know to apply it to your own robotics project.
What Is Diffusion Policy?
Diffusion Policy is a class of robot control policies based on denoising diffusion probabilistic models (DDPMs) — the same mathematical framework that underlies text-to-image models like Stable Diffusion. In the robot context, the "image" being generated is a sequence of robot actions (a trajectory). Starting from pure Gaussian noise in action space, the model iteratively denoises it conditioned on the current visual observation and robot state, producing a coherent, high-quality action sequence after 10–100 denoising steps.
The key insight is that diffusion models learn a full probability distribution over actions rather than predicting a single best action. For robotics, this is critical. Human demonstrations of the same task are naturally multimodal: a person might grasp a cup from the left side or the right side depending on subtle contextual cues. A model that must collapse this distribution to a single prediction will either commit to one mode and fail the other half of the time, or average the modes and produce a bizarre in-between trajectory that fails always. Diffusion Policy avoids this by modeling the distribution explicitly and sampling from it at inference time.
Why Diffusion Policy Outperforms Standard Behavioral Cloning
Standard behavioral cloning (BC) trains a policy as a supervised regression problem: given observation, predict action. This works when the mapping from observations to actions is deterministic and unimodal. In practice, manipulation tasks rarely are. Even "simple" tasks like picking a block off a table involve multiple valid approach angles, grasp poses, and pre-grasp configurations. Naive BC produces policies that hesitate at decision points, make compromised motion choices, or fail outright when the test distribution differs slightly from training.
Diffusion Policy consistently outperforms BC baselines on benchmark manipulation suites. In the original paper, it achieved state-of-the-art results on 11 of 12 tasks in the Robomimic benchmark, with particularly large margins on tasks with high action multimodality. On real-robot evaluations, Diffusion Policy demonstrated more robust recovery behavior — when the robot reached a slightly wrong intermediate state, the policy could recover because it was sampling from a broad distribution rather than following a deterministic path.
Compared to ACT (Action Chunking with Transformers), Diffusion Policy generally performs better on tasks with strong multimodality and worse on tasks with long horizon dependencies where ACT's chunk prediction shines. In practice, both algorithms are competitive enough that dataset quality and quantity matter more than the policy architecture choice. If you are unsure which to use, try ACT first for speed of iteration, then Diffusion Policy if you observe mode-averaging failures.
Data Requirements for Diffusion Policy
Diffusion Policy benefits from more data than ACT, primarily because the denoising network has more parameters and a richer modeling objective. A practical minimum is 100–200 demonstrations for a single task under controlled conditions. To achieve robust deployment performance — handling object position variation, lighting changes, and occasional sensor noise — budget 300–500 demonstrations per task. Unlike ACT, Diffusion Policy tends to continue improving with additional data up to quite large dataset sizes, making it the better choice if you plan to invest in a large-scale data collection effort.
Data diversity is as important as volume. Demonstrations should span the range of object positions, orientations, and scene configurations you expect in deployment. A tight cluster of demonstrations with objects always in exactly the same place will produce a policy that fails the moment an object is moved by a few centimeters. SVRC's managed data collection service follows structured variation protocols — systematically randomizing object positions, lighting conditions, and operator grip styles — to ensure datasets that produce generalizable policies.
The observation representation also matters significantly. Diffusion Policy with a ResNet image encoder trained end-to-end generally outperforms policies using frozen pre-trained encoders on narrow task distributions, but pre-trained encoders (R3M, MVP, DINO) produce better generalization when test conditions differ from training. For most practical projects, start with a pre-trained encoder to maximize the value of your dataset, and switch to end-to-end training only if you have 500+ demonstrations and a stable environment.
Training Setup and Compute Requirements
The reference implementation of Diffusion Policy (available at the Columbia Robotics Lab GitHub) trains with either a UNet backbone (faster inference, lower capacity) or a Transformer backbone (slower inference, higher capacity). For most single-task projects, the UNet variant is the right starting point. Training on a single RTX 3090 or 4090 takes 4–12 hours for a 200-episode dataset, depending on observation resolution and action horizon length.
Key hyperparameters to set correctly: the action horizon (how many future steps to predict — typically 16–32 for tabletop tasks), the number of diffusion steps (100 for DDPM, 10–25 for DDIM with minimal quality loss), and the observation window (how many past frames to include — typically 2). Do not change all three at once; fix the others when tuning one. The most impactful change for improving policy performance is usually increasing the dataset size, not tuning architecture hyperparameters.
For inference on a real robot, DDPM at 100 steps is typically too slow for high-frequency control. Use the DDIM scheduler with 10–25 steps, which runs at ~20Hz on an RTX 3090 — adequate for 10Hz control with a buffer. Alternatively, consistency policy distillation can achieve 1–3 step inference with minimal performance degradation for simpler tasks.
Using SVRC Data Services for Diffusion Policy
SVRC's data services pipeline produces datasets formatted for direct use with the Diffusion Policy reference implementation and the HuggingFace LeRobot framework. Episodes are stored as ZARR archives with synchronized image streams, proprioceptive state, and actions at 50Hz. Quality filtering removes episodes where the task was not completed successfully, the robot collided with the environment, or operator hesitation produced non-representative trajectories.
Our collection service uses the SVRC teleoperation platform with dual-arm capable leader-follower control, wrist-mounted and overhead cameras, and optional force-torque logging. For multi-task Diffusion Policy training — where a single policy learns multiple tasks conditioned on task ID or language — we can collect across task variants within the same campaign and deliver a unified dataset. Teams working with the OpenArm or ALOHA hardware platforms get native hardware support; custom hardware integration is available on request. Contact our team to discuss your data requirements and timeline.