Robot Data Annotation: How to Label Robot Demonstrations for Training

Annotation is the least glamorous part of robot learning and the most consequential. A dataset of 500 well-annotated demonstrations will train a better policy than 2,000 poorly labeled ones. Here is what annotation means for robot data and how to do it right.

What Annotation Means for Robot Data

Unlike image classification, where annotation means drawing boxes or clicking labels, robot demonstration annotation is richer and more structured. A single robot episode — typically 20–200 seconds of manipulation — needs to be labeled at multiple levels: was the episode a success or failure, what language describes the task, where do semantically distinct phases begin and end, and are there any frames that should be excluded from training due to hardware errors or operator mistakes.

Annotation is typically done by human reviewers watching video replays of recorded episodes alongside plots of joint states and gripper aperture. Good annotation tools display synchronized video from multiple cameras simultaneously, making it easy to judge success from perspectives the robot's own cameras might not capture clearly.

Success Flags: The Most Important Annotation

Every episode in a robot training dataset must be labeled with a binary success flag: did the robot complete the task successfully. This sounds simple, but success criteria must be defined precisely before annotation begins. "Place the cup on the plate" requires a specification: does the cup need to be upright, does the handle orientation matter, how much positional error is acceptable? Annotators applying different implicit standards to the same dataset create noisy labels that degrade training performance.

Write a one-page success specification document before annotation begins, with example images of success and failure cases. Use this document to calibrate annotators. Measure inter-annotator agreement on a shared subset of episodes — if agreement is below 90%, your success criteria need clarification. SVRC's annotation pipeline requires explicit success criteria documents and inter-annotator agreement checks before any dataset is marked ready for training.

Language Labels

Language annotations attach natural language descriptions to episodes or episode segments. These are required for training language-conditioned policies — policies that follow instructions like "pick up the red block" rather than having the task hardcoded. Language annotations also enable compatibility with vision-language-action (VLA) models and allow datasets to be searched and filtered by task description.

Write language annotations at two levels of specificity: a short task name ("cup placement") and a natural language instruction ("pick up the white cup and place it on the blue plate"). The instruction should describe what a human observer sees happening, not the robot's internal state. If your task involves task variations — different objects, different target locations — each variation should have a corresponding instruction that distinguishes it from the others.

Task Segmentation

For long-horizon tasks involving multiple sequential sub-tasks, segmentation labels mark the boundaries between phases. A table-setting task might be segmented into: reach cup, grasp cup, transport cup, place cup, release cup. Segmentation enables hierarchical policy training, sub-task-level success metrics, and selective data augmentation. It also enables surgical debugging: if a policy fails during transport but succeeds during grasping, segmentation labels let you measure sub-task success rates and target data collection effort where it is needed most.

Segmentation annotation is more expensive than success flagging and not always necessary. Prioritize segmentation for tasks with three or more semantically distinct phases, or when you plan to use a hierarchical policy architecture.

Annotation Tools and Quality Standards

Common annotation tools for robot data include Label Studio (open source, supports video and time-series data), CVAT (computer vision annotation tool, good for bounding box overlays), and custom episode browsers built with Gradio or Streamlit. SVRC's data platform includes a built-in episode annotation interface accessible through the web app, supporting success flags, language labels, and frame-level exclusion marking.

Quality standards matter more than quantity. SVRC applies a three-stage quality gate to all datasets: operator self-annotation immediately after recording, secondary review by a trained annotator, and automated consistency checks comparing annotations against joint state statistics (e.g., episodes marked success where the gripper never closed are flagged for re-review).

SVRC's Annotation Pipeline

When you use SVRC's data collection services, annotation is part of the deliverable. Our operators annotate each episode with success flags and language labels during the recording session, and our annotation team performs secondary review before dataset export. You receive a dataset with high-confidence annotations, annotator agreement scores, and a full quality report. For teams bringing their own collected data, SVRC offers annotation-only services and can process existing datasets collected on any supported hardware platform. Contact us to discuss your dataset annotation needs.

Related: Data Services · LeRobot Guide · ACT Policy Explained · Robot Policy Generalization