Why Data Format Matters More Than You Think
Robot training data format is not a detail you can defer. The format you choose determines which training frameworks you can use without conversion, which community datasets you can mix in, how easily you can inspect and debug episodes, and whether you can collaborate with other labs or commercial providers.
The robotics data ecosystem has three dominant formats, each associated with a different training framework community: HDF5 (ACT, ALOHA, most custom research pipelines), RLDS/TFRecord (Octo, Open X-Embodiment, TensorFlow-based pipelines), and LeRobot Parquet (Hugging Face LeRobot, growing fast). You will eventually need all three. The question is which to use as your primary storage format.
Our recommendation: use HDF5 as your source-of-truth collection format and convert to RLDS or LeRobot on demand. HDF5 is the most inspection-friendly format, has mature Python tooling, and can represent arbitrary episode structures without schema rigidity.
HDF5: Structure, Strengths, and Conventions
HDF5 (Hierarchical Data Format 5) stores data in a hierarchical filesystem-like structure. Each episode is a group; within each episode, datasets hold time-indexed arrays.
- Typical ALOHA/ACT HDF5 structure:
/episode_0/observations/images/cam_high— uint8 [T×480×640×3]/episode_0/observations/images/cam_wrist_left— uint8 [T×480×640×3]/episode_0/observations/qpos— float32 [T×14]/episode_0/observations/qvel— float32 [T×14]/episode_0/action— float32 [T×14]- Chunking strategy: Chunk along the time axis (chunk_size=1 step for random access, or 32 steps for sequential-read efficiency). Always chunk — unchunked HDF5 reads images as a single monolithic block, making episode streaming impractical.
- Compression: Use GZIP level 1 or LZF for image data. LZF is 3–5× faster to compress/decompress than GZIP at similar ratios for camera data. Use GZIP level 4 for joint trajectories (higher compression ratio, speed not critical).
- Metadata attributes: Store episode metadata as HDF5 group attributes:
episode_0.attrs['success'],episode_0.attrs['task'],episode_0.attrs['robot'],episode_0.attrs['operator_id'].
HDF5's weaknesses: no built-in versioning, no cloud streaming (must download full file to read), and schema incompatibilities between labs require manual normalization. These are manageable in practice — most episodes are <200 MB and schemas within a project are consistent.
RLDS / TFRecord: Open X-Embodiment and Octo
RLDS (Robot Learning Dataset Specification) is the format used by the Open X-Embodiment dataset (22 robot types, 527K episodes) and the Octo and ORCA generalist policies. It serializes data as TFRecord files, processed via TensorFlow Dataset (TFDS) pipelines.
- Structure: Each dataset is a TFDS DatasetBuilder with a defined features schema. Episodes are tf.RaggedTensor sequences of steps; each step has observation, action, reward, is_terminal.
- Standard RLDS step structure:
observation: {image: uint8[H×W×C], state: float32[D]},action: float32[D],reward: float32,discount: float32,is_terminal: bool,is_last: bool,language_instruction: string. - Used by: Octo (open-source generalist policy from Berkeley), ORCA, RT-2 evaluation scripts, Open X-Embodiment data mix scripts.
- Strengths: Standardized schema enables mixing datasets across robots; efficient streaming via tf.data; community scale (50+ datasets available).
- Weaknesses: TensorFlow dependency is a burden if your training framework is PyTorch; schema is rigid — adding custom sensor types requires custom DatasetBuilder; inspection requires TF tooling.
LeRobot / Parquet: The Hugging Face Ecosystem
LeRobot (Hugging Face) uses Parquet files — a columnar data format common in data engineering — to store robot episodes. This enables zero-code upload to Hugging Face Hub and instant episode visualization via the built-in web viewer.
- Structure: Each dataset is a Hugging Face Dataset with Parquet shards. Episodes are indexed by
episode_index; each row is a timestep with columns for each observation key, action, and metadata. - Video storage: LeRobot stores camera data as MP4 videos (not per-frame arrays) indexed by frame number. This reduces storage 5–10× vs. uncompressed HDF5 but adds decode latency during training.
- Metadata:
dataset_cardYAML stores task name, robot type, FPS, features schema, and statistics (mean/std per action dimension). - Strengths: One-command upload to Hugging Face; web viewer at hf.co/datasets/org/dataset; growing public dataset library (300+ datasets as of 2025); native ACT and Diffusion Policy training support in lerobot library.
- Weaknesses: MP4 video encoding is lossy — not suitable as a source-of-truth format for contact-sensitive tasks; Parquet is not ideal for variable-length episodes (padding required); schema changes require dataset rebuild.
ACT/ALOHA Format Details
The ACT paper and ALOHA hardware use a specific HDF5 variant that has become a de facto standard for leader-follower data:
/observations/images/<camera_name>— uint8 [T×H×W×3]/observations/qpos— float32 [T×DOF] — actual arm joint positions/observations/qvel— float32 [T×DOF] — actual arm joint velocities/observations/effort— float32 [T×DOF] — motor current/torque estimates/action— float32 [T×DOF] — leader arm joint positions (the supervision signal)- Key difference from generic HDF5: action is joint position from the leader arm, not Cartesian EEF pose. ACT operates in joint space. If your teleop system records Cartesian EEF poses, convert to joint space via FK/IK before training ACT.
Conversion Tools
| Conversion | Tool / Command | Notes |
|---|---|---|
| HDF5 → LeRobot | lerobot.scripts.push_dataset_to_hub | Supports ALOHA HDF5 natively; custom datasets need adapter |
| HDF5 → RLDS | rlds_creator or custom DatasetBuilder | 2–4 hours to write a custom builder; schema mapping required |
| RLDS → LeRobot | lerobot.scripts.convert_dataset --dataset-type=rlds | Works for OXE datasets; custom RLDS may need field mapping |
| LeRobot → HDF5 | Custom script (lerobot provides dataset.hf_dataset access) | 30 min to write; use huggingface_hub to stream |
| ACT HDF5 → LeRobot | lerobot.scripts.convert_dataset --dataset-type=aloha | Native support; specify camera names |
| Custom → All | SVRC Platform export | Upload once, export to any format via UI |
The most reliable conversion path is: collect in HDF5 → upload to SVRC Platform → export in target format. The platform handles schema normalization, statistics computation, and format-specific encoding (MP4 for LeRobot, TFRecord for RLDS) automatically.
Best Practices and Recommendations
- Use HDF5 as your source of truth. It is the most inspection-friendly format and easiest to repair if an episode is corrupted. Never use LeRobot MP4 or TFRecord as primary storage.
- Include a data manifest file. Store a JSON file alongside your HDF5 files with: robot model, arm serial numbers, date, task name, operator IDs, success rate, and schema version. This becomes invaluable 6 months later.
- Version your schema. Add a
schema_versionattribute to every HDF5 file root. When you change sensor configurations, bump the version and document the change. - Store action in joint space, not Cartesian. Most modern policies (ACT, Diffusion Policy, π₀) operate in joint space. Storing Cartesian EEF poses as action requires an IK solve at training time — error-prone and computationally expensive.
- Do not compress images at collection time. Store raw uint8 RGB. Apply compression (GZIP, JPEG, or LZF) only in the final archived HDF5 after episode validation. Re-encoding later is trivial; recovering quality from lossy encoding is impossible.
- SVRC Platform exports all formats on demand. Upload your HDF5 to platform.roboticscenter.ai once and download as HDF5, LeRobot Parquet, RLDS TFRecord, or CSV without re-uploading.