How We Think About Real-World Evaluation

May 28

Robot learning teams rarely fail because they “don’t have enough data.”
They fail because the data they have is not learnable in the way imitation learning, offline RL, and large policy-training pipelines actually assume.

In other words: they have logs, not a dataset.

This problem is visible across the robotics tooling ecosystem. Episode-based dataset standards exist precisely because many robotics datasets lose the sequence of interaction, encode semantics inconsistently, or differ in small ways that introduce subtle, hard-to-debug failures.

A learning-ready dataset is therefore an engineered artifact with a clear contract:

Structure — explicit episodes, steps, and success/failure semantics
Timing and calibration — synchronized, geometrically interpretable multimodal streams
Action meaning — actions recorded in the same frames and control modes used at training and deployment
Coverage — diversity, failures, and recoveries that prevent offline learning collapse
QA and reproducibility — automated checks, documentation, and versioning so teams can iterate safely

This document is written for robotics startups and applied RL teams who need data they can actually train on. It also doubles as a buyer’s guide and SOW template for procuring learning-ready robot data.

What “learning-ready” means

When we say learning-ready, we mean:

A modeling team can plug the dataset into a modern training stack and obtain trustworthy training and evaluation signals without reverse-engineering the recording process.

Robotics data is sequential decision-making data. The atomic unit is not an image or annotation—it is an episode of interaction: observation → action → outcome, repeated over time.

Well-designed datasets preserve:

interaction order
step boundaries
termination semantics
task meaning

Learning-ready data is not just stored. It is packaged, validated, and documented so downstream users do not accidentally train on misaligned timestamps, ambiguous actions, or broken calibration assumptions.

Core pillars of learning-ready robot data

Below are the pillars used to evaluate whether a dataset will support imitation learning, offline RL, and large multi-task policy training.

1. Structure and episode semantics

Why it matters

Sequential learners consume trajectories. If episode boundaries, resets, or termination semantics are ambiguous, you cannot reliably compute returns, segment behavior, or evaluate policies consistently.

Common failure modes

Episodes defined inconsistently across operators
Implicit resets that are not logged
Success/failure semantics kept in notes instead of fields

Mitigation

Explicit episode contract: start condition, termination condition, success criteria
Enforce it operationally (in tooling), not post-hoc

Validation

Exactly one episode start and one termination per episode
Monotonic step indices
Reset feasibility checks (nothing stuck, robot can be reset)

2. Timing and multi-modal synchronization

Why it matters

For robot learning, time alignment is supervision. Vision, proprioception, and actions must correspond to the same physical moment—or a precisely defined offset.

“Close enough” alignment often produces policies that train but plateau due to silent mispairing.

Common failure modes

Camera buffering lag relative to control
Mixed clocks across machines
Dropped frames that shift alignment
Missing camera feeds mid-episode

Mitigation

Timestamp streams at the source and at logging
Enforce stream-presence assertions (expected camera count per step)
Cross-modal sanity checks (e.g., commanded vs observed velocity alignment)

3. Calibration and geometry

Why it matters

If models depend on geometry—depth, multi-view, hand-eye transforms—calibration is not metadata. It is required to interpret the dataset meaningfully.

Common failure modes

Camera extrinsics drift between sessions
Calibration present but not tied to specific episodes
No way to assess calibration quality

Mitigation

Recalibrate after any camera movement
Store calibration per session or episode
Track reprojection error and quality metrics

Schema expectation

Camera intrinsics
Camera-to-robot extrinsics
Calibration version and quality metrics

4. Actions and control semantics

Why it matters

Actions are only labels if they mean the same thing at training and deployment.

Two identical-shaped vectors can represent completely different behaviors depending on:

control mode
coordinate frame
absolute vs delta semantics

Common failure modes

Logging commanded actions while execution is filtered or saturated
Mixing absolute and relative actions
Unspecified coordinate frames

Mitigation

Record control mode and frame explicitly
Record executed state alongside commands
Provide deterministic conversion utilities
Replay actions to sanity-check behavior

5. Coverage and diversity

Why it matters

Offline learning fails when the dataset does not cover the state-action regions required at deployment. Extrapolation error cannot be corrected without new data.

Coverage must include

Tasks and goals
Scenes and backgrounds
Object placements
Viewpoints
Workspace locations

Common failure modes

Over-collection of “comfortable” tasks
Resets that converge to the same placements
Over-clean environments that don’t generalize

Mitigation

Track coverage metrics during collection
Enforce quotas across tasks and scenes
Hold out scenes/tasks for evaluation

6. Failure, recovery, and non-success episodes

Why it matters

Robust manipulation depends on recovery. Failure trajectories are not noise—they are supervision.

Common failure modes

Failed episodes discarded
Failures present but unlabeled
Human recovery actions not captured

Mitigation

Record success indicator and failure type
Track recovery attempts and re-grasp counts
Include perturbations in evaluation

7. Human inputs and teleoperation

Why it matters

Human demonstrations are often the fastest way to generate high-quality trajectories for complex manipulation. Operator behavior shapes the learned policy distribution.

Common failure modes

Operator identity not tracked
Control mappings change without logging
Human corrections not marked

Mitigation

Record operator/session identifiers (pseudonymized)
Log teleop device and mapping version
Mark human interventions explicitly

8. QA, documentation, and reproducibility

Why it matters

Robotics learning pipelines are vulnerable to silent corruption. Small inconsistencies can invalidate training without obvious errors.

Mitigation

Pre-session checks (sensors online, calibration freshness)
In-session monitoring (stream presence, latency)
Post-session validation (timestamps, alignment, units)
Versioned dataset releases with changelogs
Dataset datasheet describing intended and unintended use

Formats and tooling

A dataset can be scientifically sound and still unusable if teams cannot load it efficiently or interpret it consistently.

Common formats (with tradeoffs)

FormatBest forStrengthsWeaknessesEpisode-based (RLDS / TFDS)Large-scale sequential learningExplicit semantics, standardized toolingRequires schema disciplineHDF5 (trajectory-based)Compact Python workflowsSingle artifact, easy inspectionCan become monolithicParquet + MP4Very large datasetsEfficient streaming and analyticsRequires careful indexingROS bags / MCAPRobotics-native loggingStrong tooling for ops/debugLogs-first, needs semantic enrichmentRaw foldersForensics onlyEasy to generateHigh integration debt

Recommended delivery package

Best practice is two-tier delivery:

Training tier — learning-ready dataset in the target format with schema, loader, and normalization stats
Raw tier — original sensor streams and calibration artifacts for reprocessing

Even teams that only want “training-ready” today eventually need raw data.

Checklist and SOW language

Minimal episode-level metadata (recommended)

Dataset version
Episode ID
Robot ID
Operator ID (pseudonymized)
Task ID and/or language instruction
Scene ID
Success indicator
Failure type (if any)
Calibration reference
Modalities recorded
Per-stream sampling rates

Minimal step-level fields

Step markers (start / end / terminal)
Timestamp
Robot state (joint, EE pose)
Observations (vision, depth, force if available)
Action (type, frame, value)
Optional annotations (contacts, interventions)

Buyer’s checklist (compact)

QuestionWhy it mattersIs the dataset episodic with explicit step markers?Prevents temporal bugsHow are timestamps synchronized?Misalignment destroys learnabilityIs calibration tied to episodes/sessions?Geometry drifts over timeWhat do the action vectors mean exactly?Action ambiguity breaks trainingAre failures and recoveries included?Required for robustnessWhat QA gates exist?Prevents unusable dataIs a loader provided?Reduces integration costIs there versioning and documentation?Enables reproducibility

Sample SOW paragraph (copy-paste)

Scope. Vendor will design and collect a learning-ready, episode-based robotic interaction dataset for the following tasks: [TASK LIST]. Data will be recorded on [HARDWARE SETUP] with synchronized multimodal sensing including vision, robot state, and control commands, along with per-episode metadata (task, scene, operator/session ID, success/failure labels). Vendor will deliver (1) a training-ready dataset in [FORMAT] with documented schema and reference loader, and (2) raw sensor streams for reprocessing. Vendor will implement QA gates including stream presence checks, calibration validation, and timestamp integrity checks, and will provide a QA report and versioned release notes.

How we approach learning-ready data

We focus on real-world interaction data for learning-based robotics systems, especially manipulation and contact-rich tasks.

What we deliver:

Multimodal, synchronized datasets
Human-in-the-loop teleoperation workflows
Task-driven dataset design up front
Continuous QA during collection
Transparent documentation and limitations
Delivery formats aligned with ML training pipelines

Where the real value is added is not “press record.”
It’s integrating dataset design, operations, and ML-facing packaging so teams can train quickly and iterate safely.

TL;DR for CTOs

Learning-ready robot data is not “more logs.”

It is an engineered dataset contract—episodes, timing, calibration, action semantics, coverage, QA, and documentation—that prevents silent training failures and makes iteration reproducible.

If a vendor cannot show you schemas, loaders, QA gates, and versioned documentation, you are buying integration debt, not data.

Jerry Huang

How We Think About Real-World Evaluation

What “learning-ready” means

Core pillars of learning-ready robot data

1. Structure and episode semantics

2. Timing and multi-modal synchronization

3. Calibration and geometry

4. Actions and control semantics

5. Coverage and diversity

6. Failure, recovery, and non-success episodes

7. Human inputs and teleoperation

8. QA, documentation, and reproducibility

Formats and tooling

Common formats (with tradeoffs)

Recommended delivery package

Checklist and SOW language

Minimal episode-level metadata (recommended)

Minimal step-level fields

Buyer’s checklist (compact)

Sample SOW paragraph (copy-paste)

How we approach learning-ready data

TL;DR for CTOs

Best 10 robotics paper of 2025

Silicon valley robotics center