The Garbage-In Problem
Imitation learning is fundamentally a distribution matching problem: the trained policy will approximate the distribution of behaviors in its training data. If that distribution includes failed grasps, jerky motions, and inconsistent strategies, the policy will learn to reproduce them. Unlike language model training, where individual noisy examples are smoothed by billions of others, robot demonstration datasets are small enough that every example matters.
The quality framework below identifies six dimensions that distinguish high-quality robot demonstration data from mediocre data. Each dimension has a measurable metric and a specific threshold.
Dimension 1: Task Success Rate
Only fully successful episodes should be included in training data. This sounds obvious, but in practice, teams frequently include partial successes ("the robot got close") or filter only obvious catastrophic failures while keeping borderline cases.
The performance impact: adding even 10% failed demonstrations to a training set causes a 20-30% drop in policy success rate on most L2 manipulation tasks. The mechanism is clear — the policy learns that "almost grasping" is an acceptable terminal state. Fix: binary success classification on every episode. For pick-place, this means physically checking object position at episode end. Build or use an automated classifier and supplement with human review on borderline cases.
Dimension 2: Trajectory Smoothness
Jerky demonstrations teach the policy to be jerky. High-jerk trajectories arise from operator error, controller latency, or poor workspace ergonomics. They create two problems: they are difficult for the policy to reproduce precisely (high jerk amplifies small timing errors), and they cause unnecessary mechanical wear on hardware.
Measure smoothness with a jerk metric: compute the third derivative of joint positions across the trajectory, average the L2 norm. Establish a per-task baseline from your best operators and filter demonstrations with smoothness score below 70% of that baseline. In practice, this filters 10-20% of demonstrations from novice operators but <5% from trained operators.
Dimension 3: Object and Instance Diversity
A policy trained on a single object instance (one red apple, always in the same orientation) will fail on a green apple or the same red apple in a novel pose. The minimum diversity requirement for most L2 tasks: 5 distinct object instances per category, with at least 3 starting orientations each.
This is the diversity dimension most teams underinvest in. It is easy to collect 500 demonstrations on one object instance. It takes planning and procurement to collect across 5 instances. But a 100-demo dataset with 5 instances typically generalizes better than a 400-demo dataset with 1 instance.
Dimension 4: Environmental Diversity
Environmental diversity means varying the conditions that will differ between training and deployment: lighting, table height, background clutter, table surface texture. Minimum requirements for a robust dataset: 3 distinct lighting conditions (warm overhead, cool daylight, mixed), 2 table heights (±10cm from nominal), 2 background conditions (clean table surface vs. moderate clutter).
These variations can be added to an existing collection session with minimal overhead — changing a light setting takes 30 seconds. Teams that skip environmental diversity consistently report larger-than-expected performance drops at deployment.
Dimension 5: Operator Diversity
A single-operator dataset has a systematic artifact: the operator's specific style, pace, and strategy are baked into every demonstration. The policy learns to reproduce that individual's behavior rather than the underlying task structure. This creates brittleness — the policy works well when deployment conditions match the single operator's style and fails when they differ.
Minimum recommendation: 3 distinct operators contributing roughly equal numbers of demonstrations. Each operator will approach difficult grasps differently, use different pre-grasp adjustments, and move at different speeds. This diversity is valuable — it forces the policy to learn the invariants of the task rather than the idiosyncrasies of a single person.
Dimension 6: Edge Case Coverage
Approximately 10% of demonstrations should cover challenging scenarios: cluttered workspaces, objects at the edge of the reachable range, near-failure recoveries where the operator almost drops an object and corrects. These edge case demonstrations dramatically improve policy robustness without requiring proportionally more total data.
Near-failure recoveries are particularly valuable — they teach the policy what to do when things are going wrong, which is exactly the situation where the policy needs the most guidance. Deliberately create challenging scenarios in 10-15% of your collection sessions.
SVRC's Quality Pipeline
All demonstration data collected through SVRC data services passes through an automated quality pipeline: binary success classification, smoothness scoring with operator-specific baselines, coverage analysis in embedding space to verify object and environment diversity, and human review on all borderline cases. You receive quality-certified data with per-episode quality scores and aggregate coverage statistics.