How We Think About Real-World Evaluation
Robot learning teams rarely fail because they “don’t have enough data.”
They fail because the data they have is not learnable in the way imitation learning, offline RL, and large policy-training pipelines actually assume.
In other words: they have logs, not a dataset.
This problem is visible across the robotics tooling ecosystem. Episode-based dataset standards exist precisely because many robotics datasets lose the sequence of interaction, encode semantics inconsistently, or differ in small ways that introduce subtle, hard-to-debug failures.
A learning-ready dataset is therefore an engineered artifact with a clear contract:
Structure — explicit episodes, steps, and success/failure semantics
Timing and calibration — synchronized, geometrically interpretable multimodal streams
Action meaning — actions recorded in the same frames and control modes used at training and deployment
Coverage — diversity, failures, and recoveries that prevent offline learning collapse
QA and reproducibility — automated checks, documentation, and versioning so teams can iterate safely
This document is written for robotics startups and applied RL teams who need data they can actually train on. It also doubles as a buyer’s guide and SOW template for procuring learning-ready robot data.
What “learning-ready” means
When we say learning-ready, we mean:
A modeling team can plug the dataset into a modern training stack and obtain trustworthy training and evaluation signals without reverse-engineering the recording process.
Robotics data is sequential decision-making data. The atomic unit is not an image or annotation—it is an episode of interaction: observation → action → outcome, repeated over time.
Well-designed datasets preserve:
interaction order
step boundaries
termination semantics
task meaning
Learning-ready data is not just stored. It is packaged, validated, and documented so downstream users do not accidentally train on misaligned timestamps, ambiguous actions, or broken calibration assumptions.
Core pillars of learning-ready robot data
Below are the pillars used to evaluate whether a dataset will support imitation learning, offline RL, and large multi-task policy training.
1. Structure and episode semantics
Why it matters
Sequential learners consume trajectories. If episode boundaries, resets, or termination semantics are ambiguous, you cannot reliably compute returns, segment behavior, or evaluate policies consistently.
Common failure modes
Episodes defined inconsistently across operators
Implicit resets that are not logged
Success/failure semantics kept in notes instead of fields
Mitigation
Explicit episode contract: start condition, termination condition, success criteria
Enforce it operationally (in tooling), not post-hoc
Validation
Exactly one episode start and one termination per episode
Monotonic step indices
Reset feasibility checks (nothing stuck, robot can be reset)
2. Timing and multi-modal synchronization
Why it matters
For robot learning, time alignment is supervision. Vision, proprioception, and actions must correspond to the same physical moment—or a precisely defined offset.
“Close enough” alignment often produces policies that train but plateau due to silent mispairing.
Common failure modes
Camera buffering lag relative to control
Mixed clocks across machines
Dropped frames that shift alignment
Missing camera feeds mid-episode
Mitigation
Timestamp streams at the source and at logging
Enforce stream-presence assertions (expected camera count per step)
Cross-modal sanity checks (e.g., commanded vs observed velocity alignment)
3. Calibration and geometry
Why it matters
If models depend on geometry—depth, multi-view, hand-eye transforms—calibration is not metadata. It is required to interpret the dataset meaningfully.
Common failure modes
Camera extrinsics drift between sessions
Calibration present but not tied to specific episodes
No way to assess calibration quality
Mitigation
Recalibrate after any camera movement
Store calibration per session or episode
Track reprojection error and quality metrics
Schema expectation
Camera intrinsics
Camera-to-robot extrinsics
Calibration version and quality metrics
4. Actions and control semantics
Why it matters
Actions are only labels if they mean the same thing at training and deployment.
Two identical-shaped vectors can represent completely different behaviors depending on:
control mode
coordinate frame
absolute vs delta semantics
Common failure modes
Logging commanded actions while execution is filtered or saturated
Mixing absolute and relative actions
Unspecified coordinate frames
Mitigation
Record control mode and frame explicitly
Record executed state alongside commands
Provide deterministic conversion utilities
Replay actions to sanity-check behavior
5. Coverage and diversity
Why it matters
Offline learning fails when the dataset does not cover the state-action regions required at deployment. Extrapolation error cannot be corrected without new data.
Coverage must include
Tasks and goals
Scenes and backgrounds
Object placements
Viewpoints
Workspace locations
Common failure modes
Over-collection of “comfortable” tasks
Resets that converge to the same placements
Over-clean environments that don’t generalize
Mitigation
Track coverage metrics during collection
Enforce quotas across tasks and scenes
Hold out scenes/tasks for evaluation
6. Failure, recovery, and non-success episodes
Why it matters
Robust manipulation depends on recovery. Failure trajectories are not noise—they are supervision.
Common failure modes
Failed episodes discarded
Failures present but unlabeled
Human recovery actions not captured
Mitigation
Record success indicator and failure type
Track recovery attempts and re-grasp counts
Include perturbations in evaluation
7. Human inputs and teleoperation
Why it matters
Human demonstrations are often the fastest way to generate high-quality trajectories for complex manipulation. Operator behavior shapes the learned policy distribution.
Common failure modes
Operator identity not tracked
Control mappings change without logging
Human corrections not marked
Mitigation
Record operator/session identifiers (pseudonymized)
Log teleop device and mapping version
Mark human interventions explicitly
8. QA, documentation, and reproducibility
Why it matters
Robotics learning pipelines are vulnerable to silent corruption. Small inconsistencies can invalidate training without obvious errors.
Mitigation
Pre-session checks (sensors online, calibration freshness)
In-session monitoring (stream presence, latency)
Post-session validation (timestamps, alignment, units)
Versioned dataset releases with changelogs
Dataset datasheet describing intended and unintended use
Formats and tooling
A dataset can be scientifically sound and still unusable if teams cannot load it efficiently or interpret it consistently.
Common formats (with tradeoffs)
FormatBest forStrengthsWeaknessesEpisode-based (RLDS / TFDS)Large-scale sequential learningExplicit semantics, standardized toolingRequires schema disciplineHDF5 (trajectory-based)Compact Python workflowsSingle artifact, easy inspectionCan become monolithicParquet + MP4Very large datasetsEfficient streaming and analyticsRequires careful indexingROS bags / MCAPRobotics-native loggingStrong tooling for ops/debugLogs-first, needs semantic enrichmentRaw foldersForensics onlyEasy to generateHigh integration debt
Recommended delivery package
Best practice is two-tier delivery:
Training tier — learning-ready dataset in the target format with schema, loader, and normalization stats
Raw tier — original sensor streams and calibration artifacts for reprocessing
Even teams that only want “training-ready” today eventually need raw data.
Checklist and SOW language
Minimal episode-level metadata (recommended)
Dataset version
Episode ID
Robot ID
Operator ID (pseudonymized)
Task ID and/or language instruction
Scene ID
Success indicator
Failure type (if any)
Calibration reference
Modalities recorded
Per-stream sampling rates
Minimal step-level fields
Step markers (start / end / terminal)
Timestamp
Robot state (joint, EE pose)
Observations (vision, depth, force if available)
Action (type, frame, value)
Optional annotations (contacts, interventions)
Buyer’s checklist (compact)
QuestionWhy it mattersIs the dataset episodic with explicit step markers?Prevents temporal bugsHow are timestamps synchronized?Misalignment destroys learnabilityIs calibration tied to episodes/sessions?Geometry drifts over timeWhat do the action vectors mean exactly?Action ambiguity breaks trainingAre failures and recoveries included?Required for robustnessWhat QA gates exist?Prevents unusable dataIs a loader provided?Reduces integration costIs there versioning and documentation?Enables reproducibility
Sample SOW paragraph (copy-paste)
Scope. Vendor will design and collect a learning-ready, episode-based robotic interaction dataset for the following tasks: [TASK LIST]. Data will be recorded on [HARDWARE SETUP] with synchronized multimodal sensing including vision, robot state, and control commands, along with per-episode metadata (task, scene, operator/session ID, success/failure labels). Vendor will deliver (1) a training-ready dataset in [FORMAT] with documented schema and reference loader, and (2) raw sensor streams for reprocessing. Vendor will implement QA gates including stream presence checks, calibration validation, and timestamp integrity checks, and will provide a QA report and versioned release notes.
How we approach learning-ready data
We focus on real-world interaction data for learning-based robotics systems, especially manipulation and contact-rich tasks.
What we deliver:
Multimodal, synchronized datasets
Human-in-the-loop teleoperation workflows
Task-driven dataset design up front
Continuous QA during collection
Transparent documentation and limitations
Delivery formats aligned with ML training pipelines
Where the real value is added is not “press record.”
It’s integrating dataset design, operations, and ML-facing packaging so teams can train quickly and iterate safely.
TL;DR for CTOs
Learning-ready robot data is not “more logs.”
It is an engineered dataset contract—episodes, timing, calibration, action semantics, coverage, QA, and documentation—that prevents silent training failures and makes iteration reproducible.
If a vendor cannot show you schemas, loaders, QA gates, and versioned documentation, you are buying integration debt, not data.