How to Collect Robot Training Data

Complete guide to collecting high-quality robot training data via teleoperation — hardware setup, LeRobot recording, RLDS format, quality scoring, and dataset preparation for VLA fine-tuning.

Intermediate ⏱ 2–4 hours setup Updated April 2026

1. Hardware 2. Install 3. Configure 4. Teleop Setup 5. Record 6. Review 7. RLDS Convert 8. Upload 9. Quality Checks 10. Prep for VLA

Prerequisites

A robot arm (OpenArm, SO-100, Aloha, Koch, UR5, or similar)
At least one RGB camera (wrist-mounted preferred) + one external camera
A teleoperation input device (leader arm, SpaceMouse, or keyboard)
ROS2 Humble installed (see our ROS2 setup guide)
Python 3.10+ with pip
Ubuntu 22.04 with NVIDIA GPU (for visualization and future training)

What you will build

By the end of this tutorial, you will have a complete data collection pipeline: teleoperation hardware configured, LeRobot recording episodes of your robot performing tasks, quality-scored demonstrations, and a dataset in RLDS format uploaded to HuggingFace Hub — ready for VLA fine-tuning.

Data Collection Pipeline

Teleop Input
Leader arm / SpaceMouse

→

Robot Arm
Joint states + actions

→

Cameras
Wrist + external RGB(D)

→

LeRobot
Record & sync

→

RLDS Dataset
HuggingFace Hub

Hardware Requirements

A complete data collection setup has four components. Here is what you need and our recommended options:

Component	Recommended	Budget
Robot arm	OpenArm ($2,400) or Aloha ($20K+)	SO-100 ($200 DIY kit)
Wrist camera	Intel RealSense D405 ($300)	USB webcam ($30)
External camera	Intel RealSense D435 ($350)	Logitech C920 ($70)
Teleop device	Leader arm (matched pair)	3Dconnexion SpaceMouse ($130)

Budget path: An SO-100 arm with two webcams and a SpaceMouse gets you collecting data for under $500 total. Quality is lower than a leader-follower setup, but it is enough to learn the workflow and train simple policies.

Install LeRobot

LeRobot is Hugging Face's open-source toolkit for robot learning. Install it in a virtual environment to avoid dependency conflicts.

# Create and activate virtual environment python3 -m venv ~/lerobot_env source ~/lerobot_env/bin/activate # Install LeRobot with all extras for data collection pip install "lerobot[all]" # Or install from source for latest features git clone https://github.com/huggingface/lerobot.git cd lerobot pip install -e ".[all]" # Verify installation python3 -c "import lerobot; print(f'LeRobot {lerobot.__version__} installed')"

# Install additional dependencies for camera recording pip install opencv-python pyrealsense2 h5py # For HuggingFace Hub uploads pip install huggingface_hub huggingface-cli login

Configure Your Robot in LeRobot

LeRobot uses YAML configuration files to define your robot's properties. Create a config for your specific hardware.

# Create robot config directory mkdir -p ~/lerobot_configs # Example config for OpenArm (save as openarm.yaml) cat > ~/lerobot_configs/openarm.yaml << 'EOF' robot: type: openarm port: /dev/ttyUSB0 baudrate: 1000000 joints: - name: shoulder_pan id: 1 min: -3.14 max: 3.14 - name: shoulder_lift id: 2 min: -1.57 max: 1.57 - name: elbow id: 3 min: -2.35 max: 2.35 - name: wrist_pitch id: 4 min: -1.57 max: 1.57 - name: wrist_roll id: 5 min: -3.14 max: 3.14 - name: gripper id: 6 min: 0.0 max: 1.0 cameras: wrist: index: 0 width: 640 height: 480 fps: 30 external: index: 2 width: 640 height: 480 fps: 30 recording: fps: 30 task: "pick_and_place" EOF

Important: Camera indices (index: 0, index: 2) vary by system. Run v4l2-ctl --list-devices to find the correct indices for your cameras. Get this right before recording — wrong camera indices are the most common setup error.

Set Up Teleoperation Interface

You have two main options for controlling the robot during demonstrations: a leader-follower arm pair (best quality) or a SpaceMouse (more accessible).

Option A: Leader-Follower Arms (recommended)

# Configure leader arm (the one you physically move) lerobot calibrate \ --robot-type=openarm \ --robot-port=/dev/ttyUSB0 \ --leader-port=/dev/ttyUSB1 # This runs a calibration routine — follow the on-screen prompts # to move both arms through their range of motion

Option B: SpaceMouse

# Install SpaceMouse driver pip install pyspacemouse # Test SpaceMouse connection python3 -c "import pyspacemouse; m = pyspacemouse.open(); print('SpaceMouse connected')" # Configure SpaceMouse in LeRobot lerobot calibrate \ --robot-type=openarm \ --robot-port=/dev/ttyUSB0 \ --teleop=spacemouse

Which to choose? Leader-follower arms produce more natural demonstrations because your hand movements map directly to the robot's joints. SpaceMouse is easier to set up but produces less smooth data, which can reduce policy performance by 10–20%. For production data collection, always use leader-follower.

Record First Episodes

An "episode" is one complete demonstration of a task from start to finish. Start with a simple task like picking up a block and placing it in a target zone.

# Record episodes to a local dataset lerobot record \ --robot-type=openarm \ --robot-port=/dev/ttyUSB0 \ --leader-port=/dev/ttyUSB1 \ --fps=30 \ --task="pick_red_block_place_target" \ --num-episodes=10 \ --output-dir=~/datasets/openarm_pick_place # Controls during recording: # Enter — start/stop episode # Space — pause/resume # r — redo current episode # q — quit recording

Each episode is saved with synchronized data:

~/datasets/openarm_pick_place/
  ├── episode_000/
  │   ├── observations/
  │   │   ├── joint_positions.npy    # (T, 6) joint angles
  │   │   ├── joint_velocities.npy   # (T, 6) joint velocities
  │   │   ├── wrist_cam.mp4          # wrist camera video
  │   │   └── external_cam.mp4       # external camera video
  │   └── actions.npy                # (T, 7) target joint positions + gripper
  ├── episode_001/
  ├── ...
  └── metadata.json

Episode count guidelines: For a single-task policy (ACT, Diffusion Policy), collect 50–200 episodes. For VLA fine-tuning (OpenVLA, pi0), collect 300–1,200 episodes. More diverse demonstrations yield better generalization. Vary the object position, orientation, and lighting between episodes.

Review and Quality-Score Episodes

Not every episode is usable. Review your recordings, remove failed demonstrations, and compute quality metrics.

# Visualize a specific episode lerobot visualize-dataset \ --dataset-path=~/datasets/openarm_pick_place \ --episode=0 # Replay the episode on the robot (optional — for verification) lerobot replay \ --robot-type=openarm \ --robot-port=/dev/ttyUSB0 \ --dataset-path=~/datasets/openarm_pick_place \ --episode=3

Mark failed episodes for removal:

# Compute quality metrics for all episodes python3 -c " import numpy as np import json, os, glob dataset_dir = os.path.expanduser('~/datasets/openarm_pick_place') episodes = sorted(glob.glob(os.path.join(dataset_dir, 'episode_*'))) for ep_dir in episodes: actions = np.load(os.path.join(ep_dir, 'actions.npy')) # Smoothness: lower jerk = better demonstration jerk = np.diff(actions, n=3, axis=0) smoothness = 1.0 / (1.0 + np.mean(np.abs(jerk))) length = len(actions) ep_name = os.path.basename(ep_dir) print(f'{ep_name}: length={length}, smoothness={smoothness:.3f}') "

Quality rule of thumb: Aim for 90%+ task success rate in your final dataset. Discard episodes where the robot dropped the object, collided with the table, or had jerky movements. A smaller set of high-quality demonstrations outperforms a larger noisy dataset.

Convert to RLDS Format

RLDS (Reinforcement Learning Datasets) is the standard format for VLA model training. Convert your LeRobot dataset to RLDS for compatibility with OpenVLA, Octo, and RT-2.

# Install RLDS conversion tools pip install tensorflow tensorflow-datasets rlds # Convert LeRobot dataset to RLDS python3 -m lerobot.scripts.convert_to_rlds \ --input-dir=~/datasets/openarm_pick_place \ --output-dir=~/datasets/openarm_pick_place_rlds \ --task-description="Pick up the red block and place it on the green target zone"

The RLDS dataset contains TFRecord files with this structure per timestep:

{
  "observation": {
    "image":          (480, 640, 3) uint8,   # wrist camera
    "image_external": (480, 640, 3) uint8,   # external camera
    "state":          (6,) float32            # joint positions
  },
  "action":           (7,) float32,           # target joints + gripper
  "reward":           float32,                # 1.0 for success, 0.0 otherwise
  "is_terminal":      bool,
  "language_instruction": string              # task description
}

Upload to HuggingFace Hub

Push your dataset to HuggingFace Hub for versioning, easy sharing, and direct loading during training.

# Make sure you're logged in huggingface-cli login # Push LeRobot format dataset lerobot push-to-hub \ --dataset-path=~/datasets/openarm_pick_place \ --repo-id="your-username/openarm-pick-place-v1" \ --private # Or push RLDS format dataset python3 -c " from huggingface_hub import HfApi api = HfApi() api.upload_folder( folder_path=os.path.expanduser('~/datasets/openarm_pick_place_rlds'), repo_id='your-username/openarm-pick-place-rlds-v1', repo_type='dataset', private=True ) print('Upload complete!') "

Versioning tip: Always include a version suffix (-v1, -v2). When you collect more data or fix quality issues, push as a new version. This makes it easy to reproduce training runs and track improvements.

Run Quality Checks

Before training, run a final set of quality checks to catch issues that could waste GPU hours.

python3 -c " import numpy as np import json, os, glob dataset_dir = os.path.expanduser('~/datasets/openarm_pick_place') episodes = sorted(glob.glob(os.path.join(dataset_dir, 'episode_*'))) lengths = [] smoothness_scores = [] action_ranges = [] for ep_dir in episodes: actions = np.load(os.path.join(ep_dir, 'actions.npy')) lengths.append(len(actions)) jerk = np.diff(actions, n=3, axis=0) smoothness_scores.append(1.0 / (1.0 + np.mean(np.abs(jerk)))) action_ranges.append(actions.max(axis=0) - actions.min(axis=0)) print(f'Episodes: {len(episodes)}') print(f'Avg length: {np.mean(lengths):.0f} steps ({np.mean(lengths)/30:.1f}s at 30fps)') print(f'Length std: {np.std(lengths):.0f} steps') print(f'Avg smoothness: {np.mean(smoothness_scores):.3f}') print(f'Action range diversity: {np.std(np.stack(action_ranges), axis=0).mean():.4f}') # Flag potential issues if np.std(lengths) / np.mean(lengths) > 0.5: print('WARNING: High variance in episode lengths — check for stuck episodes') if np.mean(smoothness_scores) < 0.3: print('WARNING: Low smoothness — consider re-collecting with slower movements') if len(episodes) < 50: print('WARNING: Fewer than 50 episodes — may not be enough for robust training') print('Quality checks complete.') "

Prepare for VLA Fine-Tuning

Structure your dataset for training by creating proper splits and task descriptions.

# Create train/val split (90/10) python3 -c " import os, shutil, random, glob dataset_dir = os.path.expanduser('~/datasets/openarm_pick_place') episodes = sorted(glob.glob(os.path.join(dataset_dir, 'episode_*'))) random.seed(42) random.shuffle(episodes) split = int(len(episodes) * 0.9) train = episodes[:split] val = episodes[split:] # Write split files with open(os.path.join(dataset_dir, 'train_episodes.txt'), 'w') as f: f.write('\n'.join([os.path.basename(e) for e in train])) with open(os.path.join(dataset_dir, 'val_episodes.txt'), 'w') as f: f.write('\n'.join([os.path.basename(e) for e in val])) print(f'Train: {len(train)} episodes, Val: {len(val)} episodes') "

Ready to fine-tune a VLA model?

Your dataset is prepared. Continue to the VLA fine-tuning guide to train OpenVLA or pi0 on your data. Expect to invest $150–400 in GPU compute for a typical fine-tuning run.

Data Collection Cost Breakdown

Cost Component	Per Hour	Notes
Operator labor	$50–80	Skilled teleoperator, varies by market
Hardware depreciation	$15–30	Amortized over 2,000 hrs of use
Quality review	$20–40	Episode filtering and scoring
Compute and storage	$8–15	Recording, processing, storage
Facility overhead	$25–35	Lab space, lighting, safety
Total	$118–200

Need help collecting data?

SVRC Data Services offers managed robot data collection with professional teleoperators, quality assurance, and RLDS-ready delivery. Starting at $150/hr.

Learn About Data Services

Troubleshooting

Camera feed is black or frozen

Check camera index with v4l2-ctl --list-devices. Try unplugging and replugging the USB cable. If using RealSense, verify with realsense-viewer. USB 3.0 ports are required for RealSense cameras.

Robot arm not responding during recording

Verify the serial port with ls /dev/ttyUSB*. Check permissions: sudo chmod 666 /dev/ttyUSB0. Make sure no other process is using the port (close RViz, other terminals).

Frame drops during recording (inconsistent fps)

Lower camera resolution to 480x360. Close background applications. Use a dedicated USB bus for each camera (check with lsusb -t). Record to an SSD, not an HDD.

RLDS conversion fails with shape mismatch

This usually means episodes have different observation dimensions. Check that all cameras used the same resolution across all episodes. Re-record any episodes with mismatched dimensions.

HuggingFace upload times out

Large datasets (50+ GB) can time out on slow connections. Use huggingface-cli upload with --revision to upload in parts, or compress videos before upload. Consider uploading from a cloud VM with faster bandwidth.

Frequently Asked Questions

For most single-task manipulation policies (pick and place, pouring, etc.), 50–200 high-quality episodes are enough for ACT or Diffusion Policy. For VLA fine-tuning (OpenVLA, pi0), plan for 300–1,200 episodes depending on task complexity and required generalization.

The fully-loaded cost of robot teleoperation data collection ranges from $118–200 per hour in 2026, depending on operator skill level and hardware. This includes operator time, hardware depreciation, compute, and quality review. SVRC Data Services offers managed collection starting at $150/hr.

At minimum, use one wrist-mounted RGB camera (640x480, 30fps) and one external third-person view camera. For best results, add an RGBD camera like the Intel RealSense D435 for depth data. Many VLA models expect both wrist and external camera views.

RLDS (Reinforcement Learning Datasets) is a standardized format from Google DeepMind for storing robot learning data. It stores episodes as sequences of observations, actions, and rewards in TFRecord files. Most VLA models (RT-2, OpenVLA, Octo) expect RLDS format, making it the de facto standard for interoperability.

Track three quality metrics: (1) Task success rate — only keep episodes where the task was completed successfully. (2) Trajectory smoothness — filter out jerky or erratic demonstrations. (3) Diversity — vary object positions, orientations, and lighting across episodes. Aim for 90%+ success rate in your final dataset.

Was this tutorial helpful?