Operations Playbook

The Teleoperation Data Quality Checklist Every Lab Needs

Teleoperation data is the feedstock for modern robot learning. It is also the single most common place where downstream models silently fail. This is the checklist we run every lab through before they ship a policy.

Published 2026-03-27 by the Robotics Center of Silicon Valley research team.

TL;DR. Policy quality is capped by data quality. Data quality is mostly determined by six things: calibration, latency, action smoothing, episode boundary discipline, failure-recovery labeling, and interface choice. Most labs fail on two or three of these and do not know it. Run this checklist every time you start a new task or a new data-collection campaign. Fix issues at the source, not in post-processing.

1. Calibration: the prerequisite

Calibration errors contaminate every downstream learning signal. If your camera extrinsics drift by 2 cm between Tuesday and Thursday, policies trained on the joint dataset will learn a confused object-pose distribution. Our robot camera setup guide covers the full calibration stack.

Calibration checklist

Run an extrinsic calibration at the start of each data-collection session.
Re-calibrate after any physical bump to the robot or camera mount.
Version-control calibration files alongside the recorded episodes.
Log calibration residuals; reject sessions where residuals exceed a threshold you have pre-agreed.
For bimanual rigs, calibrate both arms against a shared world frame.
Calibrate the gripper zero-position after any finger swap.

2. Latency: the budget you have to set before you start

Teleoperation latency is a first-class data quality variable. Above ~150 ms round-trip, human operators begin to compensate by slowing down and overshooting, which poisons the action distribution. Above ~300 ms, operators cannot reliably perform contact-rich tasks at all. We discuss the networking side in handling internet delay in teleoperation.

Latency checklist

Measure end-to-end latency (controller input to robot motion) before every session.
Decompose: sensor-to-network, network-to-operator, operator-to-network, network-to-actuator.
Set and enforce a session-level latency budget (we recommend <80 ms for dexterous tasks, <150 ms for tabletop).
Drop sessions that exceed the budget rather than silently including them.
Record timestamped perf counters alongside the demonstration data.

3. Action smoothing: the silent corruptor

Raw teleoperation actions are noisy. Operators jitter, controllers have sensor noise, network jitter adds more. If you train a policy on noisy raw actions, it will learn to imitate the noise. If you smooth too aggressively, you will lose the contact micro-adjustments that make skills work.

The default we recommend is a low-pass filter in the 10-20 Hz band on velocities, applied at record time rather than at train time. Record both the raw and smoothed stream so that downstream users can make their own call. Action chunking architectures (see our ACT policy explainer) tolerate smoothing better than frame-by-frame BC.

Action smoothing checklist

Record both raw and smoothed action streams.
Use a consistent filter across all sessions for one dataset.
Log the filter parameters in dataset metadata.
Inspect action autocorrelation histograms; spikes at the operator input frequency are a red flag.
Never apply smoothing on the deploy side without matching it on the record side (or vice versa).

4. Episode boundaries: the most common labeling mistake

An episode is supposed to be a single coherent attempt at a task, start-to-finish, success-or-failure. Teams routinely let episodes bleed together — an operator stops for a sip of coffee, a supervisor pauses a session to answer a question, two task attempts get joined because nobody hit the "end" button. Every one of these corrupts the episode boundary and makes downstream success-rate metrics unreliable.

The fix is operational, not technical: make the episode-start and episode-end affordances physical (a foot pedal, a big button), require an explicit "success" or "failure" classification at episode end, and review a random 5% of episode boundaries every week.

Episode boundary checklist

Physical start/end signal (pedal, button) with a visible indicator.
Mandatory success/failure classification at episode end.
Per-episode unique ID, attached to raw video for review.
Audit protocol: review 5% of randomly sampled boundaries weekly.
Reject or flag episodes whose length falls outside an expected distribution.

5. Failure recovery labeling

Operators will make mistakes and recover. Those recoveries are some of the most valuable data you can collect — they teach the policy how to get out of bad states. But only if they are labeled. An unlabeled recovery looks to the policy like a bizarre demonstration of "drop the object, then pick it up again," which it will dutifully imitate.

Label three distinct states on every episode: nominal progress, recovery (operator is correcting a mistake), and abort (the episode is being terminated early). Policies trained on this labeling can be conditioned to behave differently in each regime, and evaluators can segment success rates by how many recoveries were required. The common-mistakes catalog is in common mistakes in robot imitation learning.

Failure recovery checklist

Define "recovery" explicitly in your operator handbook with examples.
Require operators to flag recoveries live (button or voice) rather than in post-processing.
Track recovery rate per operator; retrain operators whose rate is an outlier in either direction.
Decide whether recoveries are kept in or excluded from each dataset version; document the decision.

6. Interface choice: kinesthetic vs leader-follower vs VR vs glove

The interface choice shapes your data more than almost any other decision. The four dominant options each have a signature:

Kinesthetic teaching

Operator moves the robot's arm by hand in gravity-compensated mode. Best data quality for contact-rich tasks because the operator feels the contact directly. Low throughput and poor ergonomics for long sessions.

Leader-follower (ALOHA-style)

Operator moves a kinematically similar leader arm; the follower mirrors it. Excellent ergonomics, high throughput, well-suited to bimanual. Our ALOHA guide covers the pattern. Tradeoff: limited workspace, sensitive to calibration drift between leader and follower.

VR/AR teleoperation

Operator uses a VR headset and hand controllers. Excellent for long-reach and mobile platforms; intuitive for new operators. Tradeoff: latency is harder to keep in budget, and hand-tracking fidelity is variable. See VR teleoperation companies compared.

Data gloves / hand tracking

Operator's hand motion is captured directly and mapped to a dexterous gripper or hand. Best for dexterous in-hand manipulation; weakest for gross arm motion. Still the most hardware-diverse category of the four.

Interface checklist

Choose the interface that matches your task's dominant motion (gross arm vs in-hand dexterous).
Stick with one interface per dataset version; mixing interfaces introduces domain shift.
Log the interface type, firmware version, and calibration in every episode.
Train operators specifically for the chosen interface; operator skill variance dominates interface choice on the margin.

We cover operator ergonomics in teleoperation fatigue and ergonomics, and the getting-started guide is here.

7. Operator hygiene

Operator variance is larger than most researchers realize. A dataset collected by five operators has more within-class variance than one collected by one operator over a longer timeline. The sweet spot is usually 2-4 operators per task, rotated weekly, with shared training and a weekly review session. Tactics that help:

A written task spec with photos of success and failure states.
A "warm-up" set of 10 demonstrations at the start of every session, excluded from the training set.
A shared review video every week showing three exemplary and three failed demonstrations.
Operator-level metadata in every episode; per-operator success-rate tracking.

8. Data pipeline hygiene

Data quality is a pipeline property, not a session property. Teams that succeed here treat the pipeline as production software.

Immutable raw. Never overwrite raw recordings. Derived datasets are built downstream.
Version everything. Dataset version, calibration version, filter version, annotation version.
Automated validation. Every ingested episode gets a pass/fail check on duration, action range, camera drops, etc.
Random sample review. Humans review 1-5% of episodes every week.
Deletion discipline. If you decide an episode is bad, mark it, do not delete it. Lineage matters.

Our dataset catalog and data services offering are built around this pipeline — useful if you are evaluating build vs outsource.

9. Common pitfalls

Mixing interface types in one dataset. Bad domain shift, invisible to loss curves.
Silent clock drift. Synchronize cameras and actuators against a single monotonic time source.
Uncurated "more data is better" instinct. 500 clean episodes beat 2000 noisy ones.
No hold-out. Reserve a test set of operators / scenes / object sets your policy never sees during training.
Ignoring the edge cases. Failure recovery data and off-nominal object poses are the most valuable slices.

10. Closing note

Every hour spent tightening teleoperation data quality saves roughly ten hours of downstream debugging. The upstream fix is almost always cheaper than the downstream rescue. Treat data collection as a production discipline, not a chore handed to the newest lab member.

If you want a second set of eyes on your pipeline, our data services team audits teleop operations as part of our standard engagements. Hardware, leasing, compare, buyer guides, and tutorials round out the references for teams building this capability from scratch. Also worth reading: why teleop beats sim data, scaling data collection teams, and what makes good training data.

Print this. This checklist is meant to live on the wall of your data collection room. Run it every new task, every new operator, every new hardware refresh.

11. Frequently asked questions

How many episodes do we actually need for a working policy?

For a narrow, well-defined single-arm task with clean data, 200-400 episodes is typically enough to fine-tune Octo or OpenVLA to research-demo quality. Bimanual tasks typically need 500-1000 episodes of comparable quality. Above that threshold the marginal return from more episodes shrinks rapidly unless the task distribution itself is broadening. We explore the compute side in scaling VLA training on a budget.

Should we collect data in the deployment environment or in the lab?

Both. Collect the majority of data in the lab, where you control lighting, calibration, and camera placement. Collect a smaller, deliberately varied corpus in the deployment environment — this is where you discover the covariate shift that your lab-only data would have hidden. A 90/10 lab-to-field ratio is a reasonable starting split.

Is VR teleoperation good enough for production data?

For gross manipulation and mobile tasks, yes, with attention to latency and hand-tracking calibration. For dexterous in-hand tasks and contact-rich assembly, leader-follower or glove-based capture is still the higher-quality option in 2026.

How much should a teleop operator be paid?

Market rates in the United States are in the $25-45/hour range for unskilled teleop, $50-90/hour for trained dexterous operators. Cheap labor often produces expensive debugging; we strongly recommend paying at the upper end of the range and investing in training.

Can we use existing public datasets and skip our own collection?

For pre-training, yes — that is what Open X-Embodiment is for. For the final fine-tune on your specific embodiment and task, there is no shortcut. Your policy's behavior will reflect the data it was fine-tuned on.