Why Scaling Is Hard
Collecting 200 demonstrations is an engineering project. Collecting 10,000 is an operations problem. The bottlenecks shift completely: at small scale you are debugging hardware and task specification; at large scale you are managing operator throughput, data pipeline reliability, and cross-lab consistency.
The robotics community learned this the hard way building datasets like DROID (76,000 trajectories, 22 robots, 13 labs) and Open X-Embodiment (1M+ episodes, 22 robot types). Each of these projects required purpose-built infrastructure that took months to engineer. This article documents the patterns that work.
Operator Throughput Bottlenecks
The single largest constraint on data collection velocity is operator throughput. A skilled operator running a moderate-complexity task (L2–L3 on our difficulty scale) completes 30–80 demonstrations per day. To reach 10,000 demos in 8 weeks (40 working days), you need:
- At 50 demos/operator/day: 5 operators running concurrently for 40 days
- At 30 demos/operator/day (harder task): 9 operators for 40 days
- Plus 20–30% buffer for QA rejection: add 1–2 more operators
Sourcing, training, and maintaining 5–12 concurrent operators while keeping quality high is the core challenge. Operators are not fungible — quality variance between operators is large (success rates of 60–95% on the same task are common within a single team). Managing this variance requires systematic QA, not just ad hoc review.
Distributed Collection Architecture
Beyond single-lab scale (~200 demos/week), you need distributed collection across multiple stations or labs. The DROID dataset ran across 13 academic labs simultaneously. The key architectural decisions:
- Shared data lake: All labs write episode files to a common S3 bucket using a standardized HDF5 schema. Episode metadata (operator ID, robot serial, timestamp, task ID, success label) is written atomically at episode completion. Avoid per-lab schemas that require post-hoc normalization.
- Hardware standardization: Use identical camera models, mounting positions, and robot configurations across all sites. Even minor camera position differences (5–10 cm) degrade cross-site policy training. Publish a calibration protocol document and run remote verification sessions.
- Task instruction protocol: Standardized task cards with photos, object placement templates (laser-cut acrylic trays work well), and video examples of correct vs. incorrect execution. Without this, operators at different sites develop subtly different interpretations of the task.
- Real-time monitoring dashboard: Track demos/hour, success rate, and active operators per site on a shared dashboard. This surfaces site-level problems (hardware failure, operator confusion) before they consume significant budget.
Operator Quality Management at Scale
Quality management is the difference between a dataset that trains a working policy and one that does not. At scale, you cannot review every episode manually.
- Gold standard demonstrations: Collect 20–50 "gold" demos per task from your best operator. Use these as the quality reference. Automatically compute similarity scores (DTW on joint trajectories, success rate correlation) between new operator demos and gold standards.
- Inter-rater reliability: For subjective success labels, have two reviewers independently label 10% of episodes. Measure Cohen's kappa. Target κ > 0.8. If kappa is below 0.7, your success criteria are ambiguous and need refinement.
- Operator scorecards: Track per-operator success rate, rejection rate, and throughput on a weekly basis. Operators with persistent rejection rates >35% need retraining or reassignment.
- Incentive structures: Per-quality bonuses (not per-volume) prevent operators from rushing. A $0.10–$0.25 bonus per accepted demo (above a quality threshold) outperforms hourly-only compensation for data quality.
Data Pipeline Infrastructure
At 10,000+ episodes, manual data handling is not feasible. You need an automated pipeline from episode capture to training-ready tensors.
- Episode format: HDF5 with a standardized schema (observations group: camera images, proprioception; actions group: joint velocities or end-effector deltas; metadata group: task, operator, timestamp). LeRobot's open-source dataset format is a good reference implementation.
- Automated validation: On ingest, run: (1) schema validation, (2) duration check (reject episodes <2s or >120s for most tasks), (3) joint limit violation check, (4) camera dropout detection (black frames). Reject and flag automatically.
- Versioning: Use DVC (Data Version Control) or a custom manifest system to tag dataset versions. Training runs must reference an immutable dataset version — this is critical for reproducibility when debugging policy failures weeks after data collection.
- Preprocessing: Image resizing, normalization, and augmentation should be applied at training time from the raw HDF5, not baked into the stored data. This allows reprocessing with different augmentation parameters without re-collection.
Quality Filtering at Scale
Even with good operator management, you will have 20–40% of episodes that should not go into the training set. Manual review does not scale past a few thousand episodes.
- Success classifier: Train a lightweight binary classifier (ResNet-18 or similar) on a manually labeled subset of 500–1,000 episodes to predict success/failure from the final frame. Apply to all remaining episodes. Target 90%+ precision on the positive class.
- Trajectory outlier detection: Compute per-episode statistics (max joint velocity, trajectory length, action entropy). Use isolation forest or simple threshold rules to flag outliers. Outliers are disproportionately error episodes.
- Diversity sampling: After filtering, check that your training set has coverage across all task variants, object positions, and operator styles. Use k-means clustering on episode embeddings to identify underrepresented regions and collect more data there.
Real Numbers from Industry
| Dataset | Trajectories | Labs / Sites | Robot Types | Collection Period |
|---|---|---|---|---|
| DROID (Stanford) | 76,000 | 13 labs | 1 (Franka) | ~8 months |
| Open X-Embodiment | 1,000,000+ | 22 institutions | 22 types | Aggregated 2016–2023 |
| ALOHA-2 (Stanford) | 1,500 | 1 lab | 1 (ALOHA-2) | 3 months |
| BridgeData V2 | 60,000 | 4 lab configs | 1 (WidowX) | ~12 months |
| RoboSet (UT Austin) | 20,000 | 1 lab | 1 (Franka) | 4 months |
The DROID dataset is the gold standard reference for distributed data collection. Their technical report documents the calibration protocol, data format, and quality filtering pipeline in detail. For teams building a multi-lab program, it is required reading.
SVRC's data collection platform implements the architecture described in this article — episode streaming, automated QA, operator dashboards, and S3-backed versioned storage — available as a managed service.