How to Collect Robot Training Data
Complete guide to collecting high-quality robot training data via teleoperation — hardware setup, LeRobot recording, RLDS format, quality scoring, and dataset preparation for VLA fine-tuning.
Prerequisites
- A robot arm (OpenArm, SO-100, Aloha, Koch, UR5, or similar)
- At least one RGB camera (wrist-mounted preferred) + one external camera
- A teleoperation input device (leader arm, SpaceMouse, or keyboard)
- ROS2 Humble installed (see our ROS2 setup guide)
- Python 3.10+ with pip
- Ubuntu 22.04 with NVIDIA GPU (for visualization and future training)
What you will build
By the end of this tutorial, you will have a complete data collection pipeline: teleoperation hardware configured, LeRobot recording episodes of your robot performing tasks, quality-scored demonstrations, and a dataset in RLDS format uploaded to HuggingFace Hub — ready for VLA fine-tuning.
Data Collection Pipeline
Leader arm / SpaceMouse
Joint states + actions
Wrist + external RGB(D)
Record & sync
HuggingFace Hub
Hardware Requirements
A complete data collection setup has four components. Here is what you need and our recommended options:
| Component | Recommended | Budget |
|---|---|---|
| Robot arm | OpenArm ($2,400) or Aloha ($20K+) | SO-100 ($200 DIY kit) |
| Wrist camera | Intel RealSense D405 ($300) | USB webcam ($30) |
| External camera | Intel RealSense D435 ($350) | Logitech C920 ($70) |
| Teleop device | Leader arm (matched pair) | 3Dconnexion SpaceMouse ($130) |
Install LeRobot
LeRobot is Hugging Face's open-source toolkit for robot learning. Install it in a virtual environment to avoid dependency conflicts.
Configure Your Robot in LeRobot
LeRobot uses YAML configuration files to define your robot's properties. Create a config for your specific hardware.
index: 0, index: 2) vary by system. Run v4l2-ctl --list-devices to find the correct indices for your cameras. Get this right before recording — wrong camera indices are the most common setup error.
Set Up Teleoperation Interface
You have two main options for controlling the robot during demonstrations: a leader-follower arm pair (best quality) or a SpaceMouse (more accessible).
Option A: Leader-Follower Arms (recommended)
Option B: SpaceMouse
Record First Episodes
An "episode" is one complete demonstration of a task from start to finish. Start with a simple task like picking up a block and placing it in a target zone.
Each episode is saved with synchronized data:
Review and Quality-Score Episodes
Not every episode is usable. Review your recordings, remove failed demonstrations, and compute quality metrics.
Mark failed episodes for removal:
Convert to RLDS Format
RLDS (Reinforcement Learning Datasets) is the standard format for VLA model training. Convert your LeRobot dataset to RLDS for compatibility with OpenVLA, Octo, and RT-2.
The RLDS dataset contains TFRecord files with this structure per timestep:
Upload to HuggingFace Hub
Push your dataset to HuggingFace Hub for versioning, easy sharing, and direct loading during training.
-v1, -v2). When you collect more data or fix quality issues, push as a new version. This makes it easy to reproduce training runs and track improvements.
Run Quality Checks
Before training, run a final set of quality checks to catch issues that could waste GPU hours.
Prepare for VLA Fine-Tuning
Structure your dataset for training by creating proper splits and task descriptions.
Ready to fine-tune a VLA model?
Your dataset is prepared. Continue to the VLA fine-tuning guide to train OpenVLA or pi0 on your data. Expect to invest $150–400 in GPU compute for a typical fine-tuning run.
Data Collection Cost Breakdown
| Cost Component | Per Hour | Notes |
|---|---|---|
| Operator labor | $50–80 | Skilled teleoperator, varies by market |
| Hardware depreciation | $15–30 | Amortized over 2,000 hrs of use |
| Quality review | $20–40 | Episode filtering and scoring |
| Compute and storage | $8–15 | Recording, processing, storage |
| Facility overhead | $25–35 | Lab space, lighting, safety |
| Total | $118–200 |
Need help collecting data?
SVRC Data Services offers managed robot data collection with professional teleoperators, quality assurance, and RLDS-ready delivery. Starting at $150/hr.
Learn About Data ServicesTroubleshooting
Camera feed is black or frozen
Check camera index with v4l2-ctl --list-devices. Try unplugging and replugging the USB cable. If using RealSense, verify with realsense-viewer. USB 3.0 ports are required for RealSense cameras.
Robot arm not responding during recording
Verify the serial port with ls /dev/ttyUSB*. Check permissions: sudo chmod 666 /dev/ttyUSB0. Make sure no other process is using the port (close RViz, other terminals).
Frame drops during recording (inconsistent fps)
Lower camera resolution to 480x360. Close background applications. Use a dedicated USB bus for each camera (check with lsusb -t). Record to an SSD, not an HDD.
RLDS conversion fails with shape mismatch
This usually means episodes have different observation dimensions. Check that all cameras used the same resolution across all episodes. Re-record any episodes with mismatched dimensions.
HuggingFace upload times out
Large datasets (50+ GB) can time out on slow connections. Use huggingface-cli upload with --revision to upload in parts, or compress videos before upload. Consider uploading from a cloud VM with faster bandwidth.
Frequently Asked Questions
For most single-task manipulation policies (pick and place, pouring, etc.), 50–200 high-quality episodes are enough for ACT or Diffusion Policy. For VLA fine-tuning (OpenVLA, pi0), plan for 300–1,200 episodes depending on task complexity and required generalization.
The fully-loaded cost of robot teleoperation data collection ranges from $118–200 per hour in 2026, depending on operator skill level and hardware. This includes operator time, hardware depreciation, compute, and quality review. SVRC Data Services offers managed collection starting at $150/hr.
At minimum, use one wrist-mounted RGB camera (640x480, 30fps) and one external third-person view camera. For best results, add an RGBD camera like the Intel RealSense D435 for depth data. Many VLA models expect both wrist and external camera views.
RLDS (Reinforcement Learning Datasets) is a standardized format from Google DeepMind for storing robot learning data. It stores episodes as sequences of observations, actions, and rewards in TFRecord files. Most VLA models (RT-2, OpenVLA, Octo) expect RLDS format, making it the de facto standard for interoperability.
Track three quality metrics: (1) Task success rate — only keep episodes where the task was completed successfully. (2) Trajectory smoothness — filter out jerky or erratic demonstrations. (3) Diversity — vary object positions, orientations, and lighting across episodes. Aim for 90%+ success rate in your final dataset.
Was this tutorial helpful?