10 Common Mistakes in Robot Imitation Learning (and How to Fix Them)

Introduction

Imitation learning is deceptively approachable. The algorithm is simple, the setup is standard, and there are excellent open-source implementations for ACT, Diffusion Policy, and LeRobot. Yet most teams hit the same set of avoidable mistakes -- mistakes that can waste weeks of engineering time and produce policies that fail at deployment. This post catalogs ten of the most common, with concrete detection methods and fixes for each.

Mistake 1: Too Few Demos for Task Difficulty

The most common mistake is applying a fixed demo count regardless of task complexity. Teams read that ACT achieved strong results with 50 demonstrations on a simple pick task and assume 200 will be sufficient for their precision assembly task. It is not.

A rough difficulty scale: L1 (open-loop reach-and-grasp of large objects): 50-200 demos. L2 (closed-loop pick-place with varied object poses): 300-800 demos. L3 (contact-rich insertion, assembly with 5mm tolerance): 1,000-5,000 demos. L4 (dexterous in-hand manipulation): 5,000-20,000 demos.

How to detect: Training loss converges but evaluation success rate is low (<60%). The policy learns the general motion pattern but fails on the precise contact-rich phases.

Fix: Use the difficulty scale above as a budget estimate before starting collection. If the estimated cost exceeds your budget, consider whether the task can be simplified (larger tolerance, pre-positioned object) to drop one difficulty level. SVRC's data services can provide a cost estimate for your specific task before you commit.

Mistake 2: Inconsistent Lighting Between Collection and Deployment

Lighting changes that look trivial to a human eye -- moving from morning to afternoon sun through a window, replacing one overhead fluorescent tube, opening a nearby door -- can change pixel values by 15-30% and cause visual policies to fail catastrophically. We have seen policies with 90% success rate in the data collection room achieve 20% success when deployed one floor up in a different building.

How to detect: Policy works well in the collection environment but fails in deployment. Visualize the camera feed from both environments -- histogram differences >15% in any channel indicate a lighting problem.

Fix: Control lighting during data collection (blackout curtains or LED panels with fixed color temperature, 5000K daylight-balanced). Apply aggressive color jitter augmentation during training: brightness +/-30%, hue +/-20%, saturation +/-15%. Test your trained policy under 5 different lighting conditions before considering it ready for deployment.

Mistake 3: Not Filtering Failed Demonstrations

Raw demonstration data from new operators typically contains 15-30% failed episodes. Operators who are learning the teleoperation system make fumbled grasps, accidentally release objects, or abort mid-task. If you include these in your training set, the policy learns to imitate both successful and unsuccessful behaviors -- which averages out to a mediocre policy.

How to detect: Policy exhibits hesitation, partial grasps, or "confused" behavior that looks like it is averaging between two different strategies. Review your training set: if >5% of episodes are clearly failed, you have this problem.

Fix: Build or use a simple success classifier. For pick-place tasks, this can be as simple as checking whether the object is in the target zone at the end of the episode. Filter failed episodes before training. At SVRC, we run automated success classification plus human review on all collected data before delivery. Even 10% failed demos causes a 20-30% drop in policy success rate.

Mistake 4: Using a Single Camera

A single fixed camera creates occlusion problems during approach. When the robot arm moves into the workspace, it can block the view of the object and the gripper simultaneously -- leaving the policy with no visual information precisely when it needs the most detail. This is particularly damaging for insertion tasks where the last 5cm of approach are critical.

How to detect: Policy succeeds on pre-grasp approach but fails during the final 2-5cm contact phase. Analysis of failure videos shows the target object is occluded by the arm at the critical moment.

Fix: Use a three-camera setup as a baseline: (1) overhead fixed camera for workspace overview, (2) side-angle camera for arm-object relationship, (3) wrist-mounted camera for contact zone. The wrist camera is the single most impactful addition -- it gives the policy direct visual feedback on the grasp contact that no external camera can provide. See our teleoperation guide for detailed camera configuration.

Mistake 5: Poor Object Pose Randomization

Many teams collect demonstrations with objects in nearly identical starting positions -- the operator places the object in roughly the same spot before each episode. This trains a policy that works for objects at those exact positions and fails when the object is shifted by 5cm.

How to detect: High success rate on training distribution positions (>90%) but rapid degradation with even small position perturbations. Plot the (x,y) start positions of all training episodes -- if they cluster tightly, you have insufficient randomization.

Fix: Before each episode, place the target object at a random position within the reachable workspace. Use a grid or marked zones to ensure systematic coverage. At minimum, the training set should cover 80% of the workspace area where the object might appear at deployment. For orientation-sensitive tasks (insertion, assembly), randomize orientation as well: at least 8 distinct starting orientations distributed evenly.

Mistake 6: Missing Contact-Rich Phase Data

Many tasks have distinct phases: approach, contact, manipulation, release. Operators often rush through the contact phase (the moment the gripper touches the object) because it feels intuitive to them. But this is precisely the phase where the policy needs the most data. When demonstrations blow through the contact phase in 2-3 timesteps, the policy has almost no training signal for the most critical moment of the task.

How to detect: Policy performs the approach correctly (moving to the right position) but fails at the moment of contact -- grasps are unstable, insertions miss, placements are off. Examine demonstration timing: if the contact phase (from first touch to stable grasp) is <5 timesteps at 50Hz, you have this problem.

Fix: Instruct operators to slow down during the contact phase. A useful protocol: approach at normal speed, slow to 50% speed when the gripper is within 3cm of the object, and maintain slow speed through the contact until stable grasp is achieved. This gives the policy 3-5x more training signal for the critical phase. Alternatively, collect additional demonstrations that start from just before the contact phase (the gripper is already positioned near the object).

Mistake 7: Evaluating on Training Distribution Only

A policy that achieves 90% success rate on the exact object instances and positions it was trained on may achieve 30% on slightly different conditions. Evaluating only on training distribution gives you an optimistic number that does not predict deployment performance.

How to detect: If you have not evaluated on held-out conditions, you have this problem by definition. Any success rate reported only on training conditions is unreliable.

Fix: Before finalizing a policy, run evaluation in three held-out conditions: (1) novel object instances of the same category (different color, size, texture), (2) novel starting positions outside the training distribution, (3) different background/table surface. Require the policy to achieve your success target in all three conditions. Budget for 30-50 evaluation episodes per condition.

Mistake 8: Ignoring Compounding Error

Behavioral cloning policies suffer from compounding error: small mistakes accumulate over time, pushing the robot into states that were never seen during training. The policy has no data to recover from these novel states, so it makes another bad decision, which leads to an even more novel state, and so on. For tasks longer than 5-10 seconds, this can cause complete failure even when the early steps look correct.

How to detect: Policy starts well but deteriorates over time within a single episode. The first 2-3 seconds look good; by second 5-8, the robot is in an unrecoverable state. Longer episodes fail at higher rates than shorter episodes.

Fix: Two concrete approaches. Action chunking (as used in ACT) predicts 20-100 future actions at once and executes them open-loop, reducing inference calls and limiting compounding. Temporal ensembling averages predictions from multiple recent action chunks for smoother execution. For very long tasks (>30s), consider hierarchical policies that break the task into 3-5s subtasks, or DAgger-style online data collection to gather recovery demonstrations.

Mistake 9: Wrong Action Space Choice

The action space -- what the policy actually predicts -- has a surprisingly large effect on performance. The two common choices are joint-space actions (target joint positions) and Cartesian-space actions (target end-effector pose). The wrong choice for your task creates unnecessary difficulty for the policy.

How to detect: Policy trains well (low loss) but produces jerky or physically implausible motions at deployment. Joint-space policies may produce configurations that cross through self-collision; Cartesian-space policies may produce targets that are kinematically unreachable.

Fix: Use joint-space actions (target joint positions) as the default for leader-follower teleoperation data, since the leader arm naturally produces joint-space targets. Use Cartesian-space actions only if your task has a strong geometric structure that is easier to express in Cartesian coordinates (e.g., drawing a straight line). For ACT specifically, the original implementation uses joint-space actions and this is the recommended default.

# Joint-space vs Cartesian-space action -- use joint-space by default
# This is how SVRC records actions for ACT training:

# GOOD: Joint-space action (7-dim: 6 joints + gripper)
action = leader_arm.get_joint_positions()  # Direct from leader arm
# Stored as: [j1, j2, j3, j4, j5, j6, gripper_width]

# AVOID (unless you have a specific reason):
# Cartesian-space action (7-dim: xyz + quaternion)
# action = robot.get_ee_pose()  # Requires IK at deployment
# IK solutions may be non-unique, introducing discontinuities

Mistake 10: Skipping the Data Flywheel

The teams with the best-performing policies are not the ones who collected the most data upfront -- they are the ones who continuously improved their data based on observed failure modes. One-time collection plus one-time training produces a policy frozen at the quality of your initial dataset.

How to detect: You trained a policy, it works at X% success rate, and you have not collected additional data since. X will not improve without new data.

Fix: Build a data flywheel from day one. Deploy the policy, log every failure, review failures weekly, identify the top 3 failure mode categories, collect 50-100 targeted demonstrations covering those failure modes, retrain. Even 50-100 targeted failure-recovery demonstrations per iteration can dramatically improve robustness. The SVRC data platform makes this loop efficient: failure logs feed directly into collection task queues, and operators collect targeted demonstrations against specific failure categories.

Dataset Quality Checklist

Before training, verify your dataset passes all of these checks:

[ ] Episode count meets difficulty threshold -- L1: 50+, L2: 300+, L3: 1000+, L4: 5000+
[ ] Failed episodes filtered -- <5% failure rate in training set after filtering
[ ] Object poses randomized -- start positions cover >80% of deployment workspace
[ ] Lighting controlled or augmented -- consistent lighting during collection OR aggressive augmentation planned
[ ] Multi-camera setup -- minimum 2 cameras (overhead + wrist), ideally 3
[ ] Timestamps monotonic and synchronized -- observation-action alignment within 10ms
[ ] Contact phases adequately sampled -- >5 timesteps from first contact to stable grasp
[ ] No operator identified artifacts -- review 10 random episodes for hesitations, corrections, or unusual patterns
[ ] Trajectory smoothness within bounds -- mean jerk below task-specific threshold
[ ] Held-out evaluation set prepared -- 30-50 episodes with novel conditions set aside for testing

SVRC's data collection pipeline enforces most of these checks automatically. For teams collecting their own data, this checklist should be reviewed before every training run. See our glossary for definitions of technical terms used in this checklist.