The Quality Drop: Three Warning Signals

Most teams scaling robot data collection hit the same wall around the 1,000-demonstration mark. Quality doesn't degrade gradually — it falls off a cliff. There are three concrete signals that predict a quality crisis before it fully arrives: inter-operator variance spikes (the same task produces wildly different trajectory styles across operators), rejection rate climbs above 40% (your automated pipeline is discarding nearly half of collected data), and pipeline bottlenecks appear (storage, preprocessing, or review stages start queuing up days of backlog).

The root cause is consistent: teams scale headcount without first scaling process. Adding a fifth or tenth operator to an undefined protocol amplifies inconsistency rather than throughput. The fix is structural, and it starts with tiering.

Operator Tier System

A three-tier operator structure maps skill level to task complexity and controls cost without sacrificing quality where it matters.

  • Junior Operators ($22/hr): Assigned to simple pick-and-place tasks with low dexterity requirements — single-arm reaches, flat surface transfers, repetitive bin picking. Target throughput: 12–18 demos per hour. Onboarding time: 4 hours including calibration on gold standard replays.
  • Senior Operators ($30/hr): Handle complex assembly, bimanual coordination, constrained insertion tasks. Expected to hit >85% first-pass acceptance rate. Participate in weekly calibration sessions and can flag ambiguous protocol cases.
  • QA Leads ($45/hr): Design task protocols, define the gold standard demo set, run calibration sessions, manage automated classifier thresholds, and perform final review on flagged batches. One QA lead per 8–10 operators is the sustainable ratio.

Quality Control at Scale

The gold standard demo set is the foundation of scalable quality. For each task, maintain exactly 50 gold standard demonstrations recorded by QA leads under ideal conditions. These serve three purposes: (1) onboarding baseline — new operators watch the top 10 before their first session; (2) calibration anchors — weekly sessions compare operator output against gold standard on identical setups; (3) classifier training labels — your automated success classifier is fine-tuned on this labeled set.

Weekly calibration sessions are non-negotiable above 10 operators. Each session runs 30–45 minutes: operators complete 5 standardized task instances, results are compared against gold standard via pose trajectory DTW distance, outliers are coached individually. Variance across operators on calibration tasks is your leading indicator — if DTW spread increases week-over-week, your protocol has drifted.

The automated success classifier catches roughly 60% of failures in practice. Typical architecture: a lightweight CNN on the final 10 frames of the wrist camera stream, binary success/failure output, trained on 200+ labeled examples per task. False positive rate matters more than false negative here — you'd rather manually review a borderline case than silently pass bad data to training.

Infrastructure: Centralized NAS vs. S3

Under 50TB of collected data, a centralized NAS with RAID-6 is operationally simpler and significantly cheaper at roughly $0.01/GB/month vs. S3's $0.023/GB/month. You get low-latency random access for episode review and replay.

Above 50TB — which a team of 20 operators reaches in roughly 6 months — S3 wins on every dimension except latency. Durability is 11 nines vs. your best hardware redundancy. You can attach Athena for SQL queries over episode metadata without spinning up a dedicated database. And you get tiered storage: hot episodes in S3 Standard, completed task archives in S3 Glacier at $0.004/GB/month.

Three pipeline automation steps deliver the biggest throughput gains: (1) automatic HDF5-to-Zarr conversion for training-ready format, (2) episode deduplication via perceptual hash on gripper camera thumbnails (catches controller glitches that re-record the same motion), and (3) metadata extraction (object pose, success label, operator ID, duration) into a PostgreSQL index for querying.

Per-Demo Cost Curve

The economics of scale are real but require deliberate infrastructure investment to realize. Raw per-demo cost at different volumes, assuming the tier system and automation above:

ScalePer-Demo CostKey Cost Driver
100 demos$80/demoSetup amortization, no automation benefit
1,000 demos$45/demoPipeline automation kicks in, NAS amortized
10,000 demos$25/demoFull tier utilization, S3 lifecycle savings
100,000 demos$12–18/demoProjected — requires dedicated QA software tooling

The steepest drop happens between 100 and 1,000 demonstrations as you amortize setup costs and pipeline development. The next meaningful inflection happens around 5,000 demonstrations when automated quality classification reaches sufficient accuracy to replace most manual review. Above 10,000, gains come primarily from operator utilization improvements — longer sessions, task batching, and cross-task operator scheduling.

Getting Started

If you're planning to scale beyond 500 demonstrations, invest in protocol documentation and calibration infrastructure before headcount. The marginal cost of a second operator on an undefined protocol is higher than the value they add.

SVRC's data collection service provides managed operator teams, protocol design, and quality control infrastructure — purpose-built for teams that need scale without building the operations org from scratch.