Why Pipeline Design Matters
Data collection for robot learning is expensive. At $30–$50/hour for operator time and lab equipment, a 1,000-episode dataset costs $5,000–$15,000 in direct costs alone. A poorly designed pipeline compounds this: inconsistent recording formats force manual conversion; missing metadata makes episodes unusable; no quality filtering means 30–40% of collected data trains the policy to fail.
A well-engineered pipeline pays for itself in the first week of use. This article describes each stage from raw teleoperation recording to a training-ready dataset, with the filters and checks that separate research-grade data from noise.
Stage 1 — Synchronized Recording
The recording stage captures all sensor streams during teleoperation with guaranteed synchronization.
- What to capture: Camera images from all views (30 Hz minimum); joint states — positions, velocities, torques (100 Hz); commanded actions (same rate as joint states); force/torque if equipped (100–1000 Hz); gripper state (open/closed + current).
- Synchronization: Hardware camera trigger for <1 ms image sync. All streams timestamped to a single PTP-synchronized clock. Log the trigger pulse as a separate channel to verify sync quality post-collection.
- Metadata per episode: Operator ID, robot ID (for multi-robot labs), task name and variant, start/end timestamp, environment conditions (lighting preset, table configuration), unique episode UUID for traceability.
- Storage format: HDF5 with chunked datasets — one HDF5 file per episode, one HDF5 group per sensor stream. JPEG compression for images (quality 90); lossless float32 for numerical streams. Never over-write: write to a new file and validate before deleting any raw data.
Stage 2 — Quality Filtering
Not every episode is worth training on. Aggressive quality filtering is one of the highest-leverage interventions in a data pipeline — removing the bottom 20% of episodes often improves policy success rate by more than adding 50% more data.
- Automated success classification: Train a lightweight CNN on the final camera frame to predict success/failure. Label 500 episodes by hand, train in 30 minutes, achieve ~92% accuracy. Run on all new episodes before adding to the dataset.
- Trajectory smoothness filter: Compute mean absolute jerk (third derivative of position) for each joint trajectory. Episodes above the 90th percentile jerk threshold likely contain operator hesitations or corrections — flag for manual review or discard.
- Duration outlier filter: For each task, compute the mean and standard deviation of episode duration. Flag episodes more than 2σ from the mean — they are likely recovery attempts from mistakes (too long) or incomplete episodes (too short).
- Collision detection flag: Episodes where the robot safety system triggered an emergency stop should be automatically excluded.
- Typical rejection rate: 20–40% of raw episodes are rejected by automated filters. This is normal and expected — do not try to use everything.
Stage 3 — Annotation
Annotations enrich episodes with structured labels that enable better training and analysis.
- Language instruction: A natural-language string describing the task. Standardize vocabulary across operators: "pick up the
- Success score (0–100): A continuous quality score from the automated classifier, complementing the binary success label. Useful for curriculum learning — start training on high-scoring episodes.
- Phase segmentation: Label the semantic phase of each timestep: reach / grasp / transport / place / release. Enables phase-conditioned policies and per-phase analysis of failure modes.
- Failure mode label (optional): For failed episodes: grasp_failure / drop / wrong_object / timeout. Aggregate statistics guide hardware and task design improvements.
Stage 4 — Format Conversion
Different training frameworks expect different data formats. The conversion stage produces standardized exports.
- Source of truth: HDF5. Always maintain the original HDF5 files. Never delete them after conversion — the source of truth enables re-conversion when new formats emerge.
- RLDS (TensorFlow Datasets): Required format for Octo, RT-X, and Open X-Embodiment. Convert using the
rlds_dataset_buildertools. Each episode becomes a TFDS episode with standardized step structure. - LeRobot (HuggingFace Parquet): Parquet-based format compatible with the HuggingFace datasets ecosystem. Used by LeRobot, ACT-fork training scripts, and many community implementations.
- Zarr (Diffusion Policy native): Memory-mappable array store used by the original Diffusion Policy codebase for fast random access during training.
Stage 5 — Versioning and Provenance
Dataset reproducibility requires strict versioning. Models should be trainable from a dataset version tag alone — not from "the files in this folder right now."
- Episode immutability: Once an episode passes quality filtering and is added to the dataset, it is never modified. Corrections create new episodes; old ones are archived, not deleted.
- Hash-based episode IDs: Compute SHA-256 of the raw HDF5 file. Use as the episode ID — guarantees that the ID uniquely identifies the content, and any corruption is detectable.
- Dataset manifests: A JSON file per dataset version listing all episode IDs, their metadata, split assignments (train/val/test), and filter criteria. Check this file into git for full provenance.
- DVC (Data Version Control): For large datasets, DVC tracks which data files correspond to which git commit. Alternative: store manifests in git and data files in S3/GCS with content-addressed keys.
Stage 6 — Training Integration
The final stage feeds the dataset into the model training loop efficiently.
- Stratified sampling: Sample mini-batches with equal representation of task variants, object types, and operators. Prevents the model from over-fitting to the majority class.
- Online augmentation: Apply data augmentation during the forward pass (not pre-computed) to keep storage requirements manageable: color jitter (±15% brightness/contrast/saturation), random crop (±10% of image size), Gaussian noise on joint states (σ=0.001 rad).
- Episode weighting: Weight episodes by success score — high-quality successes receive higher sampling weight during training. Typical weighting: success_score² / Σ(success_score²).
The SVRC data platform automates Stages 2 through 5 of this pipeline, providing a web interface for quality review, annotation, and format export.