Robot Policy Failure Mode Analysis: Why Trained Robots Fail in Deployment

Why Failure Mode Analysis Matters

A policy that achieves 82% success in lab evaluation may fail in deployment at 40–50%. The gap is not random — it is almost always caused by specific, identifiable failure modes that could have been diagnosed and mitigated before deployment. Systematic failure mode analysis before launch is the most cost-effective quality investment a robotics team can make.

The failure modes below are organized by root cause and include both diagnostic signals (how to identify if this mode is causing failures) and mitigation strategies (how to fix it with data, architecture changes, or deployment modifications).

Failure Mode 1: Out-of-Distribution Visual Inputs

Symptoms: Policy performs well in the lab environment but fails immediately when deployed in a new location, under different lighting, or with slightly different object appearance (new color variant, worn label, different surface finish).

Root cause: The training data was collected in a narrow set of visual conditions. The policy's visual encoder has overfit to specific textures, lighting patterns, or backgrounds that are absent in the deployment environment.
Diagnostic test: Collect 20 test episodes under 3 different lighting conditions (overhead fluorescent vs. natural light vs. dim), 2 table heights (±10 cm), and with 3 distractor objects placed on the workspace. If success rate drops >20% in any condition, OOD visual inputs are the likely cause.
Mitigation — data augmentation: Augment training images with random color jitter (brightness ±30%, contrast ±30%, hue ±20%), random background substitution (paste workspace onto random ImageNet backgrounds), and random lighting simulation. Apply augmentation at training time from raw images, not baked in.
Mitigation — diverse data collection: Collect demonstrations across multiple lighting conditions, table surfaces, and background configurations during the initial data collection phase. 5× the cost vs. single-condition collection, but dramatically more robust policies.
Mitigation — domain adaptation: After deployment, collect 50–100 demonstration episodes in the deployment environment and fine-tune the policy. Even a small domain-specific fine-tuning set substantially recovers performance.

Failure Mode 2: Compounding Errors (BC Covariate Shift)

Symptoms: Policy starts the task correctly but gradually drifts into unusual states that it has not seen in training, then fails catastrophically. Failure occurs increasingly early in the task as conditions become less familiar. Classic pattern of behavior cloning policies.

Root cause: Behavior cloning (and its variants) trains only on in-distribution states (states that appeared in training trajectories). Any small policy error shifts the robot to a state not in the training distribution, causing larger errors, leading to states even further from training, and eventually catastrophic failure.
Diagnostic test: Plot the distribution of states at which failures occur along the task timeline. If failures cluster in the second half of the task and the policy succeeds more on shorter tasks, compounding errors are the cause.
Mitigation — action chunking (ACT): Predicting H future steps forces the policy to plan coherent trajectories rather than reacting step-by-step. Substantially reduces compounding. See the action chunking article.
Mitigation — DAgger: Dataset Aggregation. After an initial training set, deploy the policy and have a human expert label corrections when the policy makes errors. Add these to the training set and retrain. Most effective when human experts can observe policy rollouts efficiently.
Mitigation — diffusion policy: Diffusion's iterative denoising of action sequences is more robust to early execution errors than single-step prediction because the denoising process implicitly re-centers on plausible trajectories.

Failure Mode 3: Mode Collapse and Mode Averaging

Symptoms: Multiple valid strategies exist in the training data (e.g., grasp from left or right), but the policy executes a confused intermediate trajectory that is neither strategy and fails both. Or: the policy always executes the same single strategy regardless of task state, failing when that strategy is blocked.

Root cause: Standard mean-squared-error training of a deterministic policy minimizes average prediction error across demonstrations. When multiple modes exist, the minimum average-error prediction is the mean of all modes — which is not a valid trajectory.
Diagnostic test: Review all training demonstrations and manually identify whether multiple distinct strategies are present. If operators used qualitatively different approaches (different approach angle, different grasp type), mode averaging is likely.
Mitigation — CVAE (ACT): ACT's Conditional VAE encodes the demonstration's "style" into a latent variable, allowing the policy to commit to one mode at inference time. Requires sufficient demonstrations of each mode (20–50 per mode minimum).
Mitigation — diffusion policy: Diffusion naturally handles multi-modal distributions by learning to denoise from any of the modes, not averaging them.
Mitigation — data curation: If one strategy is strongly preferred (e.g., always approach from the left in the deployment environment), curate the training set to over-represent that strategy. A 70/30 mix rather than 50/50 will bias the policy toward the preferred mode.

Failure Mode 4: Precision Degradation Near Contact

Symptoms: Policy performs well during approach and coarse positioning but consistently fails at the moment of contact — misses by 2–5 mm on insertion tasks, drops objects during transfer, or fails to seat connectors.

Root cause: Camera-based perception has finite resolution. At a typical fixed-camera distance of 60–80 cm from the workspace, 1 camera pixel corresponds to approximately 0.5–1 mm of workspace position. For tasks requiring <3 mm precision, pixel noise is comparable to the required accuracy.
Mitigation — wrist camera: A wrist-mounted camera (5–15 cm from the contact point) provides 10–20× higher effective resolution at the critical precision phase. See the camera placement article for a full analysis.
Mitigation — force/torque feedback: Adding a wrist-mounted F/T sensor (ATI Mini45, OnRobot HEX) provides contact detection that does not depend on camera resolution. Policies can be trained to use force signals for the final contact phase.
Mitigation — two-phase policy: Split the task into a coarse-positioning phase (camera-based, standard resolution) and a precision-contact phase (wrist camera or F/T feedback). Train separate policies or use a hierarchical approach that switches to the precision policy near contact.

Failure Mode 5: Occlusion Handling

Symptoms: Policy performs well when the workspace is unoccluded but fails when the robot arm or gripper comes between the camera and the object of interest. Failures concentrated in the final approach phase where the arm blocks the camera view.

Root cause: Fixed cameras cannot see through the robot arm. When the robot approaches an object, the arm occludes the fixed camera's view. If the training data was not collected under these same occlusion conditions, the policy receives out-of-distribution images precisely when it most needs accurate visual feedback.
Mitigation — wrist camera: A wrist-mounted camera maintains a consistent view of the gripper-object interface regardless of arm configuration. Occlusion by the arm on the wrist camera is physically impossible.
Mitigation — multi-view setup: Adding a second fixed camera at a different angle (e.g., overhead + side) ensures that at least one camera always has a clear view of the workspace, even during arm occlusion.
Mitigation — consistent collection conditions: If occlusion is unavoidable in deployment, ensure that training data was collected with the arm in the same occlusion configurations. Collecting training data on a different-shaped table where occlusion does not occur trains the policy on non-deployment-representative visual inputs.

Failure Mode Evaluation Protocol

Before deploying a policy, run this systematic evaluation protocol:

Nominal conditions: 30 trials in training lab conditions. Establishes baseline success rate.
Lighting variation: 10 trials each in 3 distinct lighting conditions (overhead fluorescent, warm ambient, dim). Any >20% success rate drop flags OOD visual inputs.
Height and position variation: 10 trials at 2 table heights (±10 cm from training height). Flags calibration-sensitive policies.
Distractor objects: 10 trials with 3–5 irrelevant objects in the workspace. Flags foreground confusion.
Novel object variants: 10 trials with a different color or texture variant of the target object. Flags texture overfitting.
Deployment environment: 20 trials in the actual deployment environment before launch. The non-negotiable final gate.