Why Pipeline Design Matters

Data collection for robot learning is expensive. At $30–$50/hour for operator time and lab equipment, a 1,000-episode dataset costs $5,000–$15,000 in direct costs alone. A poorly designed pipeline compounds this: inconsistent recording formats force manual conversion; missing metadata makes episodes unusable; no quality filtering means 30–40% of collected data trains the policy to fail.

A well-engineered pipeline pays for itself in the first week of use. This article describes each stage from raw teleoperation recording to a training-ready dataset, with the filters and checks that separate research-grade data from noise.

Stage 1 — Synchronized Recording

The recording stage captures all sensor streams during teleoperation with guaranteed synchronization.

  • What to capture: Camera images from all views (30 Hz minimum); joint states — positions, velocities, torques (100 Hz); commanded actions (same rate as joint states); force/torque if equipped (100–1000 Hz); gripper state (open/closed + current).
  • Synchronization: Hardware camera trigger for <1 ms image sync. All streams timestamped to a single PTP-synchronized clock. Log the trigger pulse as a separate channel to verify sync quality post-collection.
  • Metadata per episode: Operator ID, robot ID (for multi-robot labs), task name and variant, start/end timestamp, environment conditions (lighting preset, table configuration), unique episode UUID for traceability.
  • Storage format: HDF5 with chunked datasets — one HDF5 file per episode, one HDF5 group per sensor stream. JPEG compression for images (quality 90); lossless float32 for numerical streams. Never over-write: write to a new file and validate before deleting any raw data.

Stage 2 — Quality Filtering

Not every episode is worth training on. Aggressive quality filtering is one of the highest-leverage interventions in a data pipeline — removing the bottom 20% of episodes often improves policy success rate by more than adding 50% more data.

  • Automated success classification: Train a lightweight CNN on the final camera frame to predict success/failure. Label 500 episodes by hand, train in 30 minutes, achieve ~92% accuracy. Run on all new episodes before adding to the dataset.
  • Trajectory smoothness filter: Compute mean absolute jerk (third derivative of position) for each joint trajectory. Episodes above the 90th percentile jerk threshold likely contain operator hesitations or corrections — flag for manual review or discard.
  • Duration outlier filter: For each task, compute the mean and standard deviation of episode duration. Flag episodes more than 2σ from the mean — they are likely recovery attempts from mistakes (too long) or incomplete episodes (too short).
  • Collision detection flag: Episodes where the robot safety system triggered an emergency stop should be automatically excluded.
  • Typical rejection rate: 20–40% of raw episodes are rejected by automated filters. This is normal and expected — do not try to use everything.

Stage 3 — Annotation

Annotations enrich episodes with structured labels that enable better training and analysis.

  • Language instruction: A natural-language string describing the task. Standardize vocabulary across operators: "pick up the and place it in the ." Used by VLA models (Vision-Language-Action) and for task conditioning in multi-task policies.
  • Success score (0–100): A continuous quality score from the automated classifier, complementing the binary success label. Useful for curriculum learning — start training on high-scoring episodes.
  • Phase segmentation: Label the semantic phase of each timestep: reach / grasp / transport / place / release. Enables phase-conditioned policies and per-phase analysis of failure modes.
  • Failure mode label (optional): For failed episodes: grasp_failure / drop / wrong_object / timeout. Aggregate statistics guide hardware and task design improvements.
  • Stage 4 — Format Conversion

    Different training frameworks expect different data formats. The conversion stage produces standardized exports.

    • Source of truth: HDF5. Always maintain the original HDF5 files. Never delete them after conversion — the source of truth enables re-conversion when new formats emerge.
    • RLDS (TensorFlow Datasets): Required format for Octo, RT-X, and Open X-Embodiment. Convert using the rlds_dataset_builder tools. Each episode becomes a TFDS episode with standardized step structure.
    • LeRobot (HuggingFace Parquet): Parquet-based format compatible with the HuggingFace datasets ecosystem. Used by LeRobot, ACT-fork training scripts, and many community implementations.
    • Zarr (Diffusion Policy native): Memory-mappable array store used by the original Diffusion Policy codebase for fast random access during training.

    Stage 5 — Versioning and Provenance

    Dataset reproducibility requires strict versioning. Models should be trainable from a dataset version tag alone — not from "the files in this folder right now."

    • Episode immutability: Once an episode passes quality filtering and is added to the dataset, it is never modified. Corrections create new episodes; old ones are archived, not deleted.
    • Hash-based episode IDs: Compute SHA-256 of the raw HDF5 file. Use as the episode ID — guarantees that the ID uniquely identifies the content, and any corruption is detectable.
    • Dataset manifests: A JSON file per dataset version listing all episode IDs, their metadata, split assignments (train/val/test), and filter criteria. Check this file into git for full provenance.
    • DVC (Data Version Control): For large datasets, DVC tracks which data files correspond to which git commit. Alternative: store manifests in git and data files in S3/GCS with content-addressed keys.

    Stage 6 — Training Integration

    The final stage feeds the dataset into the model training loop efficiently.

    • Stratified sampling: Sample mini-batches with equal representation of task variants, object types, and operators. Prevents the model from over-fitting to the majority class.
    • Online augmentation: Apply data augmentation during the forward pass (not pre-computed) to keep storage requirements manageable: color jitter (±15% brightness/contrast/saturation), random crop (±10% of image size), Gaussian noise on joint states (σ=0.001 rad).
    • Episode weighting: Weight episodes by success score — high-quality successes receive higher sampling weight during training. Typical weighting: success_score² / Σ(success_score²).

    The SVRC data platform automates Stages 2 through 5 of this pipeline, providing a web interface for quality review, annotation, and format export.

    Automate Your Data Pipeline

    SVRC handles quality filtering, annotation, format conversion, and versioning so you can focus on policy development.

    Explore the Platform