Measuring Robot Demonstration Data Quality: Key Metrics and Thresholds

Quality Over Quantity

The most consequential insight from large-scale imitation learning research over the past three years is that data quality dominates data quantity for policy performance up to roughly 10,000 demonstrations. A curated dataset of 1,000 high-quality demonstrations consistently outperforms 5,000 noisy ones when training behavior cloning, ACT, or diffusion policy.

Despite this, most data collection programs have no formal quality metrics — operators collect until they hit a target count, then hand off the data to the training team. This article defines the metrics that predict policy performance and the thresholds required for each major algorithm.

Task Success Rate

Task success rate is the most important single metric. It measures what fraction of demonstrations in your dataset represent a genuinely successful task completion, not just a completed motion.

Binary success: The gold standard — a human reviewer watches each episode and labels it pass/fail against a written rubric. Expensive at scale (budget $0.10–$0.25/episode) but the most reliable signal.
Partial credit rubrics: For complex multi-step tasks, a binary label loses information. A 5-point rubric (0: complete failure, 1: task started, 2: key intermediate achieved, 3: near-complete, 4: full success) allows granular dataset curation. Include only demos scoring 3–4 in the training set.
Automated success classifiers: For scale, train a ResNet-18 or CLIP-based classifier on a manually labeled seed set of 500–1,000 episodes. Apply to all remaining episodes. Achieves 85–92% accuracy for well-defined tasks. Always validate classifier performance against a fresh held-out human-labeled set.

Trajectory Smoothness

Smooth demonstrations train smoother policies. Jerky demonstrations — caused by operator inexperience, teleoperation lag, or mechanical binding — introduce high-frequency noise that policies struggle to learn from and may replicate in deployment.

Jerk metric: The third derivative of position with respect to time. Compute the RMS jerk over each trajectory. Good operator demos have RMS jerk < 2 m/s³ for tabletop manipulation. Flag episodes with RMS jerk > 5 m/s³ for human review.
Velocity variance: High variance in end-effector velocity (measured as the coefficient of variation of |ṗ| over the trajectory) indicates hesitation and corrections. Target CV < 0.6 for production-quality demos.
Joint velocity limits: Reject episodes where any joint exceeds 80% of its maximum rated velocity. Demonstrations near velocity limits indicate operator overreaction or hardware problems.

Workspace Coverage

A policy trained only on demonstrations that all follow the same path will fail when initial conditions vary slightly. Workspace coverage measures how well your demonstrations span the relevant state space.

End-effector convex hull volume: Compute the 3D convex hull of all end-effector positions across the dataset. Compare against the theoretical task workspace. Target >60% coverage for tasks with high initial condition variance.
Object position coverage: If object placement is randomized during collection, verify that the distribution of object start positions in your dataset is uniform across the intended range. A coverage heatmap (2D grid of start positions) makes gaps immediately visible.
Trajectory length distribution: Plot the histogram of per-episode trajectory lengths (in steps). The distribution should be unimodal and not too wide (CV < 0.4). A bimodal distribution often indicates two distinct solution strategies — important to handle explicitly.

Action Diversity

Action diversity measures whether your dataset contains a rich variety of manipulation strategies, which is important for training policies that can handle unexpected states.

Action entropy: Discretize the action space and compute the entropy of the action marginal distribution. Low entropy indicates that most demonstrations follow nearly identical action sequences — the policy may be over-specialized.
Per-operator diversity: Compute a pairwise DTW distance matrix between all pairs of trajectories from the same operator vs. different operators. High within-operator consistency is fine; low between-operator diversity suggests all operators developed the same strategy, limiting generalization.
Grasp pose diversity: For grasping tasks, extract the gripper orientation at grasp contact for each demo. Plot the distribution over the SO(3) sphere. A good dataset covers a range of approach angles, not just a single canonical top-down grasp.

Human Consistency Score

This metric measures how reproducible the demonstrations are across operators — a proxy for whether the task specification is clear and achievable.

Inter-operator agreement: Compute the success rate separately for each operator. The variance across operators (σ² of per-operator success rates) should be low. High variance (σ > 0.15) means the task instructions are ambiguous or some operators need retraining.
Gold standard similarity: Compare each operator's demos against the gold standard set using DTW on joint trajectories. Operators with mean DTW distance > 2× the gold-standard median are candidates for retraining.

Automated Quality Pipeline Architecture

At scale, quality assessment must be automated. The recommended pipeline runs each ingest episode through four sequential gates:

Gate 1 — Schema and sensor validation: Checks HDF5 schema, verifies camera frames are non-blank, validates proprioception data range. Rejects ~2–5% of episodes for technical defects.
Gate 2 — Kinematic filters: Joint limit checks, velocity limit checks, duration bounds. Rejects ~5–10%.
Gate 3 — Smoothness filter: RMS jerk and velocity variance thresholds. Rejects ~5–15% depending on operator experience.
Gate 4 — Success classifier: ML-based success prediction. Rejects ~15–25% for typical tasks. All episodes flagged by classifier are sent to human review queue rather than automatically deleted.

Quality Thresholds by Algorithm

Algorithm	Min. Success Rate	Max. Jerk (RMS)	Min. Demos (typical task)	Sensitivity to Noise
Behavior Cloning (BC)	>85%	<3 m/s³	500–2,000	High — averages modes
ACT (Action Chunking)	>80%	<4 m/s³	200–1,000	Medium — CVAE handles multimodality
Diffusion Policy	>70%	<5 m/s³	300–1,500	Low — diffusion models multi-modality
IBC (Implicit BC)	>80%	<4 m/s³	1,000–5,000	Medium
GAIL / adversarial IL	>75%	<5 m/s³	500–3,000	Low

These thresholds are conservative starting points. Your specific task, robot platform, and environment will require empirical calibration. Log your quality metrics and policy performance together so you can identify which metrics are actually predictive for your use case.

The SVRC data platform runs this quality pipeline automatically on all ingested episodes and provides per-metric dashboards for dataset review.