Introduction
Imitation learning is deceptively approachable. The algorithm is simple, the setup is standard, and there are excellent open-source implementations for ACT, Diffusion Policy, and LEROBOT. Yet most teams hit the same set of avoidable mistakes — mistakes that can waste weeks of engineering time and produce policies that fail at deployment. This post catalogs the seven most common ones, with concrete fixes for each.
Mistake 1: Too Few Demos for Task Difficulty
The most common mistake is applying a fixed demo count regardless of task complexity. Teams read that ACT achieved strong results with 50 demonstrations on a simple pick task and assume 200 will be sufficient for their precision assembly task. It is not.
A rough difficulty scale: L1 (open-loop reach-and-grasp of large objects): 50-200 demos. L2 (closed-loop pick-place with varied object poses): 300-800 demos. L3 (contact-rich insertion, assembly with 5mm tolerance): 1,000-5,000 demos. L4 (dexterous in-hand manipulation): 5,000-20,000 demos.
We have seen teams attempt L3 tasks with 500 demos and conclude the algorithm does not work — when the real issue is data volume. The fix: use the difficulty scale above as a budget estimate before starting collection. If the estimated cost exceeds your budget, consider whether the task can be simplified (larger tolerance, pre-positioned object) to drop one difficulty level.
Mistake 2: Inconsistent Lighting Between Collection and Deployment
Lighting changes that look trivial to a human eye — moving from morning to afternoon sun through a window, replacing one overhead fluorescent tube, opening a nearby door — can change pixel values by 15-30% and cause visual policies to fail catastrophically. We have seen policies with 90% success rate in the data collection room achieve 20% success when deployed one floor up in a different building.
Fix: control lighting during data collection (blackout curtains or LED panels with fixed color temperature). Apply aggressive color jitter augmentation during training (±30% brightness, ±20% hue, ±15% saturation). Test your trained policy under 5 different lighting conditions before considering it ready for deployment.
Mistake 3: Not Filtering Failed Demonstrations
Raw demonstration data from new operators typically contains 15-30% failed episodes. Operators who are learning the teleoperation system make fumbled grasps, accidentally release objects, or abort mid-task. If you include these in your training set, the policy learns to imitate both successful and unsuccessful behaviors — which averages out to a mediocre policy.
The performance impact is larger than most teams expect: even 10% failed demos in the training set causes a 20-30% drop in policy success rate on most tasks. Fix: build or use a simple success classifier. For pick-place tasks, this can be as simple as checking whether the object is in the target zone at the end of the episode. Filter failed episodes before training. At SVRC, we run automated success classification plus human review on all collected data before delivery.
Mistake 4: Using a Single Camera
A single fixed camera creates occlusion problems during approach. When the robot arm moves into the workspace, it can block the view of the object and the gripper simultaneously — leaving the policy with no visual information precisely when it needs the most detail. This is particularly damaging for insertion tasks where the last 5cm of approach are critical.
Fix: use a three-camera setup as a baseline. A standard configuration: (1) overhead fixed camera for workspace overview and object detection, (2) side-angle fixed camera for arm-object spatial relationship, (3) wrist-mounted camera for close-up contact zone. The wrist camera is the single most impactful addition — it gives the policy direct visual feedback on the grasp contact that no external camera can provide.
Mistake 5: Evaluating on Training Distribution Only
A policy that achieves 90% success rate on the exact object instances and positions it was trained on may achieve 30% on slightly different conditions. Evaluating only on training distribution gives you an optimistic number that does not predict deployment performance.
Fix: before finalizing a policy, run evaluation in three held-out conditions: (1) novel object instances of the same category (different color, size, texture), (2) novel starting positions outside the training distribution, (3) different background/table surface. Require the policy to achieve your success target in all three conditions — not just the original training setup.
Mistake 6: Ignoring Compounding Error
Behavioral cloning policies suffer from compounding error: small mistakes accumulate over time, pushing the robot into states that were never seen during training. The policy has no data to recover from these novel states, so it makes another bad decision, which leads to an even more novel state, and so on. For tasks longer than 5-10 seconds, this can cause complete failure even when the early steps look correct.
Fix: two concrete approaches. Action chunking (as used in ACT) predicts 20-100 future actions at once and executes them open-loop, which reduces the number of inference calls and limits compounding. Temporal ensemble averages predictions from multiple recent action chunks for smoother, more robust execution. For very long tasks, consider hierarchical policies or DAgger-style online data collection to gather recovery demonstrations.
Mistake 7: Skipping the Data Flywheel
The teams with the best-performing policies are not the ones who collected the most data upfront — they are the ones who continuously improved their data based on observed failure modes. One-time collection plus one-time training produces a policy frozen at the quality of your initial dataset.
Fix: build a data flywheel from day one. Deploy the policy, log every failure, review failures weekly, identify the top 3 failure mode categories, collect targeted demonstrations covering those failure modes, retrain. Even 50-100 targeted failure-recovery demonstrations per iteration can dramatically improve robustness. The SVRC data platform makes this loop efficient: failure logs feed directly into collection task queues, and operators collect targeted demonstrations against specific failure categories.