Why Real-World Robot Data Matters

The single biggest bottleneck in deploying learned robot manipulation policies is not model architecture — it is data. Simulation has improved dramatically, but the sim-to-real gap remains stubbornly wide for contact-rich tasks. Objects deform, slip, and interact in ways that even the best physics engines approximate poorly. Lighting, surface textures, and camera noise in real environments create visual distributions that synthetic rendering cannot fully replicate.

The numbers tell the story. In 2025, Google DeepMind's RT-2 achieved 62% success on novel object manipulation using 130,000 real-world demonstrations. Toyota Research Institute's Diffusion Policy reached 94% success on trained tasks with just 200 demonstrations — but required those demonstrations to be high-quality, consistent, and collected on the exact hardware configuration used for deployment. The lesson: real-world data is not optional, and data quality matters more than data quantity for most practical tasks.

This guide covers everything ML teams need to know about collecting robot demonstration data: teleoperation methods, data formats, quality control, cost estimation, and scaling strategies.

Teleoperation Methods Compared

Teleoperation — a human operator remotely controlling a robot to demonstrate desired behaviors — is the dominant method for collecting robot manipulation data. The choice of teleoperation interface directly affects data quality, collection speed, operator fatigue, and cost. See our dedicated teleoperation guide for deeper technical detail.

VR Headset Teleoperation (Meta Quest 3)

How it works: The operator wears a VR headset and sees the robot's camera feed (or a mixed-reality view). Hand controller positions are mapped to end-effector Cartesian positions via inverse kinematics.

  • Latency: 15-40 ms end-to-end (WiFi + IK computation + arm command)
  • Cost: $500 for the headset; no additional hardware beyond the robot arm
  • Throughput: 15-25 demonstrations per hour for tabletop manipulation
  • Data quality: Good for gross manipulation (picking, placing, sorting). Lower precision for contact-rich tasks (insertion, threading)
  • Operator fatigue: VR headset causes nausea in some operators after 60-90 minutes. Plan for 15-minute breaks every hour
  • Best for: Teams on a budget, tasks that do not require sub-millimeter precision, rapid prototyping of data collection pipelines

Leader-Follower Arms

How it works: The operator physically moves a lightweight "leader" arm; a heavier "follower" arm replicates the leader's joint positions in real time. This is the approach used by ALOHA, Mobile ALOHA, and many university research setups.

  • Latency: 3-8 ms (direct USB bus, no network)
  • Cost: $3,000-$8,000 for the leader arm (in addition to the follower/production arm)
  • Throughput: 20-35 demonstrations per hour (fastest method)
  • Data quality: Highest for joint-space tasks. The operator has proprioceptive feedback through the physical leader arm. Gravity compensation makes extended sessions comfortable
  • Operator fatigue: With gravity compensation, operators can work 2-3 hour sessions. Without it, fatigue sets in after 45 minutes
  • Best for: Serious data collection campaigns, contact-rich manipulation, bimanual tasks

Exoskeleton / Haptic Gloves

How it works: The operator wears a hand exoskeleton or haptic glove that tracks finger joint angles and maps them to a dexterous robot hand.

  • Latency: 5-15 ms
  • Cost: $8,000-$20,000 per pair of gloves
  • Throughput: 8-15 demonstrations per hour (dexterous tasks are slower)
  • Data quality: Essential for dexterous manipulation (multi-finger grasping, in-hand rotation, tool use). Not useful for whole-arm tasks where finger control is irrelevant
  • Operator fatigue: High. Gloves are physically demanding. 45-minute session limit with 10-minute breaks
  • Best for: Dexterous manipulation research, hand-object interaction datasets

Joystick / Gamepad

How it works: The operator uses joystick axes to control end-effector velocity or joint velocities. Typically a 6-axis SpaceMouse ($200) or a dual-analog gamepad.

  • Latency: 1-5 ms (direct USB HID)
  • Cost: $30-$300
  • Throughput: 5-12 demonstrations per hour (slowest method due to limited bandwidth)
  • Data quality: Lower — the operator cannot directly specify 6-DOF poses, leading to jerky trajectories and suboptimal paths. Demonstrations are less consistent between operators
  • Best for: Quick prototyping, mobile robot navigation, tasks where arm trajectory smoothness is not critical
Method Latency Hardware Cost Demos/Hour Data Quality
VR Headset 15-40 ms $500 15-25 Good
Leader-Follower 3-8 ms $3K-$8K 20-35 Highest
Exo Gloves 5-15 ms $8K-$20K 8-15 High (dexterous)
Joystick 1-5 ms $30-$300 5-12 Lower

Data Formats

Choosing the right data format early prevents painful migration later. The robotics community has converged on three primary formats, each with different strengths.

HDF5 (Hierarchical Data Format 5)

HDF5 is the most widely used format for robot demonstration data. It stores heterogeneous data (joint positions, images, metadata) in a single file with efficient random access. The ACT and Diffusion Policy codebases use HDF5 natively.

Structure: Each episode is a group containing datasets for /observations/qpos (joint positions), /observations/images/* (camera frames), /action (commanded positions), and metadata attributes (task name, operator ID, timestamp). Images are stored as compressed byte arrays with chunked access for efficient frame-level retrieval.

Pros: Mature library support (h5py, HDFView), efficient compression (LZF, GZIP), random access to any frame. Cons: Not natively streamable, large files can be slow to open on networked storage, no built-in version control.

RLDS (Reinforcement Learning Datasets)

Google's RLDS format uses TFRecord files with a standardized schema. It is the native format for the Open X-Embodiment dataset, the largest collection of robot manipulation data (2.2M+ episodes across 22 robot types).

Pros: Standardized schema enables cross-dataset training, native TensorFlow integration, streamable from cloud storage. Cons: TensorFlow dependency, less intuitive for PyTorch users, sequential access patterns (no efficient random frame access).

LeRobot Format

Hugging Face's LeRobot format uses Parquet files for tabular data (joint positions, actions) and MP4 video files for camera observations. It is designed for sharing on the Hugging Face Hub and integrates with the LeRobot training framework.

Pros: Hugging Face Hub integration (upload/download/version with git-lfs), compact video storage, human-readable metadata, growing community adoption. Cons: MP4 compression introduces artifacts (lossy), video decoding adds latency during training, newer format with less mature tooling.

Recommendation: Collect in HDF5 for maximum compatibility. Use lerobot.scripts.convert_dataset to export to LeRobot format for sharing, and to RLDS for cross-embodiment training.

Data Quality Checklist

Collecting 1,000 bad demonstrations is worse than collecting 50 good ones. Quality directly determines policy performance. Use this checklist to audit your data before training.

1. Task Success Rate

Every episode should complete the full task successfully. Failed demonstrations (dropped objects, missed grasps, incomplete trajectories) actively harm policy training. Implement real-time quality review: an observer (human or automated) marks each episode as success/failure immediately after collection. Target: 95%+ success rate in your dataset. If operators are failing more than 5% of attempts, the task setup or teleoperation interface needs improvement, not more data.

2. Trajectory Consistency

For a given task, demonstrations should follow similar (but not identical) strategies. If operator A picks up objects from the left side and operator B picks from the right, the policy will learn an ambiguous bimodal distribution that fails in both cases. Standardize approach strategies across operators: define the grasp point, approach angle, and placement position before collection begins.

3. Scene Diversity

Systematically vary: object positions (random within a defined region), object instances (3-5 instances of the same category with different shapes/colors), lighting conditions (overhead, angled, dim), background clutter (add/remove distractor objects). Without diversity, the policy overfits to the exact scene layout of your lab.

4. Temporal Consistency

All sensor streams (cameras, joint states, actions) must be timestamped and synchronized. Verify synchronization by recording a known event (arm contacts table) and checking that the contact appears in all streams within 10 ms. Dropped frames in camera streams create temporal gaps that degrade policy training — monitor frame drop rates and reject episodes with >2% frame loss.

5. Metadata Completeness

Every episode needs: task name, operator ID, timestamp, robot serial number, camera configuration, success/failure label, and free-text notes for anomalies. This metadata is essential for dataset management, quality analysis, and reproducibility. It costs nothing to record and is painful to reconstruct later.

Cost Breakdown

Understanding the true cost of data collection helps ML teams budget appropriately and identify where to invest for maximum impact.

Hardware Costs (One-Time)

  • Robot arm + gripper: $2,800 - $45,000 depending on platform (see our robot arm guide)
  • Teleoperation interface: $500 (VR) to $8,000 (leader arm)
  • Cameras: $600 - $2,000 (2-4 Intel RealSense D405/D455 cameras)
  • Compute: $2,000 - $5,000 (workstation for recording and real-time visualization)
  • Workspace fixtures: $500 - $2,000 (mounting hardware, task fixtures, lighting)

Typical total hardware cost for one collection station: $8,000 - $60,000.

Operating Costs (Per-Episode)

  • Operator time: At $25-50/hour and 20 demos/hour, each episode costs $1.25 - $2.50 in operator labor
  • Quality review: 30 seconds per episode at $20/hour = $0.17 per episode
  • Scene reset: 15-60 seconds between episodes, included in operator throughput
  • Storage: ~50 MB per episode (3 cameras, 30 fps, 15-second episodes) = $0.001 per episode on cloud storage

Effective cost per high-quality episode: $1.50 - $3.00 (excluding hardware amortization).

Campaign Cost Examples

  • Small research dataset (200 episodes): $300 - $600 in operator time. 1-2 days of collection.
  • Production-ready single-task dataset (2,000 episodes): $3,000 - $6,000 in operator time. 2-3 weeks with one station.
  • Multi-task dataset (10,000 episodes, 5 tasks): $15,000 - $30,000 in operator time. 2-3 months with one station, 2-4 weeks with four parallel stations.

Scaling from 100 to 10,000 Demonstrations

The path from a proof-of-concept dataset to a production-scale dataset requires deliberate infrastructure investment. Here is the scaling playbook.

Stage 1: Proof of Concept (50-200 episodes)

One operator, one robot, one task. The goal is to validate your teleoperation setup, data pipeline, and training code. Train an ACT or Diffusion Policy model on this data and evaluate on the real robot. If you cannot get 70%+ success with 100 demonstrations of a simple task, the problem is data quality or pipeline bugs, not data quantity. Fix those first.

Stage 2: Single-Task Production (500-2,000 episodes)

Add a second operator to increase throughput and validate that your collection protocol is operator-independent. Implement automated quality checks: episode length bounds, joint limit violations, success/failure classification. Set up a dataset management system to track collection progress, quality metrics, and operator performance.

Stage 3: Multi-Station Scaling (2,000-10,000 episodes)

Deploy 2-4 parallel collection stations with identical hardware configurations. Use a central data pipeline that aggregates episodes from all stations, runs quality checks, and produces training-ready datasets nightly. At this scale, operator management becomes the bottleneck: recruit, train, and retain 4-8 operators who can maintain consistent quality over weeks of collection.

Key infrastructure investments at this stage:

  • Automated reset mechanisms: Reduce the 15-60 seconds of manual scene reset between episodes. Even saving 15 seconds per episode across 10,000 episodes saves 42 hours of operator time.
  • Real-time quality dashboards: Operators and supervisors need visibility into per-station throughput, success rates, and data quality metrics. The SVRC data platform provides this out of the box.
  • Standardized calibration procedures: With multiple stations, camera extrinsics and robot base calibration must be standardized. A 2 mm calibration error across stations introduces systematic noise that degrades policy training.

Stage 4: Cross-Task and Cross-Embodiment (10,000+ episodes)

At this scale, you are building a foundation dataset. Consider contributing to open datasets (Open X-Embodiment, DROID) for community benefit and citation impact. Use RLDS or LeRobot format for interoperability. Invest in task ontology and metadata standards so your dataset is navigable and reusable.

SVRC Data Collection Services

Silicon Valley Robotics Center operates dedicated data collection infrastructure for ML teams who need high-quality robot demonstration data without building their own collection pipeline.

  • Managed data collection: We collect demonstration data for your task on our hardware. You specify the task, objects, and success criteria; we deliver HDF5 or LeRobot-format datasets. Typical delivery: 500 episodes in 1 week.
  • Hardware packages: Pre-configured data collection stations (OpenArm + cameras + compute + software) ready to deploy in your lab.
  • Training and documentation: Comprehensive guides for every step from teleoperation setup to policy training.
  • Open datasets: Browse our library of publicly available robot manipulation datasets.