What Is Teleoperation?
Robot teleoperation means a human operator controls a robot from a remote interface to accomplish physical tasks. In robot learning, teleoperation serves two purposes: (1) data collection — generating high-quality demonstration episodes for training learned policies, and (2) live remote operation — performing tasks in environments that are hazardous or inaccessible to humans.
The quality of teleoperation data is the single biggest determinant of downstream policy performance. Noisy, jerky, or inconsistent demonstrations produce policies that are noisy, jerky, and inconsistent. A well-engineered teleoperation system is not optional — it is foundational.
System Components
A complete teleoperation system has five subsystems that must work together reliably.
- Operator interface: The device the human uses to command the robot. Options include a VR headset + hand tracking (intuitive but expensive), a leader arm (a lightweight replica of the robot arm that the operator physically moves), or a 6-DOF spacemouse / joystick (simple but limited in expressiveness). Leader arms are the gold standard for dexterous manipulation data collection.
- Video stream: One or more camera feeds from the robot environment transmitted to the operator. Typical setup: one wrist-mounted camera (object-centric view) + one external overview camera. The operator's display must show the feeds with minimal latency.
- Command channel: Transmits the operator's motion commands (joint angles or Cartesian poses) to the robot controller at high frequency (100–1000 Hz). Must be reliable and low-latency — dropped packets cause jerky motion that contaminates training data.
- Safety subsystem: Emergency stop button accessible to operator and local safety observer; software watchdog that stops the robot if the command stream is interrupted for >100 ms; collision detection in the robot controller.
- Data logger: Records all streams synchronously for training: camera images, joint states, commanded actions, force/torque readings, and episode metadata. Synchronization to within 10 ms is sufficient for most policies; sub-millisecond sync requires hardware triggers.
Leader-Follower Kinematics
The leader-follower architecture is the most common setup for high-quality data collection. The operator physically moves a leader arm (gravity-compensated so it feels light); the follower arm (the real robot) mirrors those motions in real time.
The control loop runs as follows:
- Step 1 — Read leader joints: Joint encoders on the leader arm report angles at 1000 Hz.
- Step 2 — Forward kinematics: Compute the Cartesian pose of the leader end-effector from joint angles using the kinematic model.
- Step 3 — Inverse kinematics: Solve for the follower joint angles that achieve the same end-effector pose in the follower's coordinate frame.
- Step 4 — Send to follower: Command the follower arm's joint position controller. Latency from step 1 to step 4 must be <5 ms for smooth following.
- Step 5 — Log: Record leader joints, follower joints, commanded pose, and all sensor data to disk.
Gravity compensation on the leader arm is critical: without it, the operator must hold the arm up against gravity, causing fatigue and tense, unnatural motions. Compensation applies a torque at each joint equal and opposite to gravitational load, making the arm feel nearly weightless.
Video Streaming Requirements
Latency is the enemy of good teleoperation. When the operator sees the robot's actions with a 500 ms delay, they overcorrect and create oscillatory, unstable motions. Target latency for comfortable teleoperation is under 150 ms glass-to-glass (camera sensor to operator display).
- Resolution: 720p (1280×720) minimum for manipulation tasks; 1080p preferred when detecting small objects.
- Frame rate: 30 fps minimum; 60 fps significantly improves operator performance on fast tasks.
- Protocol: WebRTC for remote operation over the internet — handles NAT traversal and adapts to network conditions. For local lab use, raw UDP or GStreamer over LAN achieves lower latency.
- Compression: H.264 is standard; H.265 offers better quality at same bitrate but higher encoder latency. For data recording (not streaming), save lossless or high-quality JPEG.
Latency Budget Breakdown
| Pipeline Stage | Typical Latency | Optimization Lever |
|---|---|---|
| Camera sensor exposure + readout | 10–30 ms | Higher frame rate, global shutter |
| USB/GigE transfer to host | 1–5 ms | GigE preferred over USB3 for determinism |
| Encoding (H.264 hardware) | 5–20 ms | Hardware encoder (NVENC, V4L2) |
| Network transmission (LAN) | 1–2 ms | Wired ethernet, QoS |
| Network transmission (WAN) | 20–200 ms | CDN edge nodes, WebRTC ICE |
| Decoding on display host | 5–15 ms | Hardware decoder |
| Display refresh latency | 8–16 ms | 120 Hz display |
| Total (LAN target) | 30–90 ms | — |
| Total (WAN target) | 50–300 ms | — |
Data Recording Format
Each teleoperation session produces a set of episodes. An episode is one continuous attempt at the task from start to finish.
The standard recording format uses HDF5 files — one file per episode — with the following structure:
- Images: Stored as compressed JPEG arrays at 30 Hz. A 2-minute episode with 3 cameras at 720p takes approximately 500 MB uncompressed; JPEG reduces this to ~50 MB.
- Joint states: Logged at 100 Hz: position, velocity, and effort for all joints.
- Actions: The commanded joint positions at 100 Hz (for training, this is the target output).
- Language instruction: A string field: "pick up the red block and place it in the bin."
- Metadata: Episode ID, operator ID, robot ID, task name, start timestamp, success label, environment conditions (lighting, table height).
Quality Metrics
Not all teleoperation episodes are equal. Track these metrics to maintain data quality across operators and sessions:
- Task success rate per operator: Some operators achieve 80%+ success; others are below 50%. Train new operators before collecting data for training — aim for >70% success rate before including an operator's data.
- Trajectory smoothness score: Mean absolute jerk (third derivative of position) across the episode. High jerk indicates hesitation, corrections, or controller instability — all artifacts the policy should not learn.
- Inter-operator consistency: KL divergence between action distributions of different operators on the same task. High divergence means operators are using different strategies — consider standardizing the approach.