Why Annotation Matters

A robot teleoperation session produces a trajectory: a time-indexed sequence of joint states, camera frames, and end-effector poses. This is data in the physical sense, but it is not yet training data in the sense that most policy learning algorithms require. Algorithms like ACT, diffusion policy, and behavior cloning need to know which episodes succeeded (to train on) and which failed (to exclude or weight down). Foundation model fine-tuning requires language instructions that describe what the robot was doing. Phase-conditioned policies require segment boundaries. Contact-learning algorithms require timestamps of first contact, stable grasp, and release events.

Annotation is the process of adding this structured information. It's time-consuming, requires human judgment, and is easy to do wrong in ways that degrade policy quality. A dataset with inconsistent success labels will produce a policy that has learned to imitate some failures. A dataset with vague language instructions will produce a policy that cannot generalize to instructions at inference time.

The Five Annotation Types

  • 1. Task Success / Failure: Binary pass/fail is the minimum. For training purposes, binary is often sufficient, but a 0–100 partial credit score adds information: a grasp that succeeded but was slow and awkward is different from one that was fast and clean. Partial credit scores of 0–30 (clear failure), 31–69 (partial/marginal), 70–100 (success with quality gradient) provide better signal for curriculum learning and data weighting.
  • 2. Language Instruction: "Pick up the red cup" is better than "pick up the object." "Grasp the cylindrical red plastic cup from its body, not the rim, and place it upright on the white tray" is better still for fine-grained tasks. The instruction should specify the task-relevant object attributes (color, shape, material where relevant), the desired grasp strategy if specific, and the goal state. Avoid object-name-only instructions ("pick up the cup") for datasets intended for foundation model fine-tuning — these are too low-information.
  • 3. Segment Labels: Dividing each episode into phases (reach, grasp, transport, place) enables phase-conditioned policies and more targeted data analysis. Minimum four phases for standard pick-place tasks. Assembly tasks may require 8–12 segments. Segment boundaries should be marked at the video frame level, not just the action index.
  • 4. Contact Event Timestamps: For contact-learning tasks (insertion, assembly, surface following), precise timestamps for first contact (gripper touches object), stable grasp (contact force stabilizes above threshold), and release (gripper opens, object free) are essential for learning contact-conditioned behaviors. Manual annotation error on contact timestamps should be <5 frames (167ms at 30fps). Automated detection from F/T sensor data is preferable where available.
  • 5. Quality Scores: Beyond success/failure, per-episode quality metrics inform data weighting during training. Smoothness score (inverse of mean jerk over episode), success confidence (annotator's certainty about the success label), and difficulty estimate (for curriculum learning prioritization) are the three most useful.

Annotation Methods Comparison

MethodAccuracyCostThroughputBest For
Expert labeler (domain knowledge)Highest (95%+)High ($40–60/hr)20–40 episodes/hrGround truth, gold standard, contact events
Trained crowdsource (MT, Scale)Medium (80–90%)Medium ($5–15/episode)100–500/hr at scaleSuccess/failure, language, segment labels
Naive crowdsource (MTurk open)Lower (70–80%)Low ($1–5/episode)HighSimple success/failure on clear tasks only
Trained classifier (CNN/LSTM)Medium-high (88–93%)Very low (compute only)Thousands/hrSuccess/failure at scale, auto-annotation
Active learning loopHigh (improves with data)Decreasing per labelHigh after warmupLarge datasets with expert budget constraint

Inter-Annotator Agreement

Cohen's kappa is the standard measure of inter-annotator agreement, adjusted for chance agreement. For robot annotation:

  • Kappa > 0.8 (strong agreement): The task definition and annotation protocol are clear enough that annotators apply them consistently. This is the target for production annotation pipelines. Most simple success/failure tasks reach this level with proper training.
  • Kappa 0.6–0.8 (moderate agreement): Requires attention. Force reconciliation meetings where annotators discuss disagreed cases and update the protocol until the definition of edge cases is explicit. Do not ship a dataset with kappa in this range without reconciliation.
  • Kappa < 0.6 (poor agreement): The task definition is ambiguous. Stop annotation, redesign the protocol with clearer success criteria, and re-annotate from scratch. Training on data annotated with kappa < 0.6 produces policies with inconsistent behavior that is extremely difficult to debug.

Automated Annotation

Two automated annotation approaches are production-ready:

  • Success Classifier CNN: Fine-tuned ResNet-50 on the final 10 frames of each episode, binary success/failure output. SVRC's internal classifier achieves 92% accuracy on held-out test sets across standard manipulation tasks. Requires 100+ labeled examples per task to train reliably. Use for large datasets after human-labeled training set is established.
  • Segment Detector (HMM on joint velocity + F/T): Hidden Markov Model trained on joint velocity profiles and F/T readings to detect phase boundaries. Works without any visual processing, making it fast and robust to camera issues. Achieves approximately 85% segment boundary accuracy within ±3 frames on standard pick-place tasks.

SVRC Annotation Pipeline

SVRC's data collection service includes annotation as standard. All collected episodes receive: binary success/failure label (automated + human review for borderline cases), language instruction label (protocol-defined per task), four-phase segment boundaries (reach/grasp/transport/place), and contact event timestamps where F/T sensors are present. Additional annotation types (partial credit scores, expanded segment sets, quality scores) are available as add-ons.

All annotation is accompanied by inter-annotator agreement metrics (kappa scores per annotation type) and documented reconciliation records for kappa < 0.8 cases. See our data services page for full annotation specifications.