Robot Learning

RoboMimic: Complete Guide to the Robot Imitation Learning Benchmark (2026)

RoboMimic is the standard benchmark for evaluating imitation learning and offline RL algorithms on robotic manipulation tasks. If you are developing or evaluating a new policy learning method, your results need to be reported against RoboMimic. This guide covers the datasets, algorithms, how to run experiments, and how to collect new RoboMimic-compatible data using SVRC hardware.

What Is RoboMimic?

RoboMimic is a framework and benchmark for robotic learning from demonstrations, developed at Stanford University by Ajay Mandlekar et al. and published at CoRL 2021. It provides three things in one package: a curated collection of demonstration datasets with varying operator quality, a set of imitation learning and offline RL algorithm implementations, and a standardized evaluation protocol that makes cross-paper comparisons meaningful.

Before RoboMimic, comparing imitation learning papers was difficult because every group collected their own proprietary datasets with different operators, different quality standards, and different evaluation protocols. A paper claiming 80% success rate might be using easy demonstrations from experts, while another claiming 60% used messy demonstrations from novices — the numbers were not comparable. RoboMimic solved this by providing a fixed public dataset that everyone trains on, making benchmark numbers directly comparable across papers and implementations.

RoboMimic uses the robosuite simulation environment (built on MuJoCo) for evaluation, which means you can reproduce results exactly without physical hardware — though the framework supports physical robot transfer as well. SVRC uses RoboMimic as the primary benchmark for evaluating algorithm performance before deploying policies to physical hardware.

The RoboMimic Datasets

RoboMimic includes four tasks, each with multiple dataset variants collected from operators of different skill levels. This multi-quality structure is the benchmark's most important design decision: it lets you test how robust your algorithm is to dataset quality, which matters enormously in practice because real-world data collection produces mixed-quality demonstrations.

Tasks

Task	Description	Difficulty	Demo Count
Lift	Pick a small cube off a table and raise it above a height threshold	Easy	200 per variant
Can	Pick a soda can from a bin and place it in a designated target region	Medium	200 per variant
Square	Pick a square nut and fit it onto a peg — requires precise insertion alignment	Hard	200 per variant
Transport	Two-arm task: pick a hammer from a closed bin, transfer between arms, place in target bin	Very Hard	200 per variant

Dataset Variants

Each task has four dataset variants that reflect different levels of operator consistency:

PH (Proficient-Human): Demonstrations from a single highly skilled operator who completes every trial successfully with consistent, smooth motions. This is the easiest dataset to learn from — low variance, high quality.
MH (Multi-Human): Demonstrations from 6 operators of varying skill levels: 1 "worse" operator, 3 "okay" operators, and 2 "better" operators (the proficient human from PH). The operators use different strategies and have different success rates. This is the most realistic dataset — real-world collections always have mixed quality.
MH-subsets: Separate downloads for the worse/okay/better subsets of MH, letting you study how quality filters affect performance.
RH (Rendered-Human): Demonstrations rendered with image observations rather than state-based observations, for evaluating visual policies.

For most research purposes, the MH dataset is the most informative benchmark because it tests both the algorithm's ability to filter noisy demonstrations and its ability to learn from multimodal data. Algorithms that score high on PH but much lower on MH have an overfitting problem with clean data — they do not generalize to real-world collection conditions.

File Format and Download

RoboMimic datasets are stored in HDF5 format with a standard schema. Each file is named by task and operator quality:

# HDF5 file structure for a single RoboMimic demo file
lift_ph_low_dim.hdf5
  /data/
    demo_0/
      /obs/
        robot0_eef_pos          # (T, 3) end-effector position
        robot0_eef_quat         # (T, 4) end-effector orientation
        robot0_gripper_qpos     # (T, 2) gripper joint positions
        object                  # (T, 14) object pose + other env obs
      /actions                  # (T, 7) delta EEF pos + ori + gripper
      /rewards                  # (T,) per-step rewards (sparse 1.0 at success)
      /dones                    # (T,) episode termination flags
    demo_1/
      ...
  /mask/
    train                       # indices for train split
    valid                       # indices for validation split

Download the datasets from the official RoboMimic website at https://robomimic.github.io/docs/datasets/robomimic_v0.1.html. Files range from 50MB (lift_ph_low_dim) to 14GB (transport_mh_image). The low-dim variants (state observations only) train much faster and are sufficient for algorithm development; the image variants are required for visual policy evaluation.

Algorithms: BC, BC-RNN, HBC, and IRIS

RoboMimic implements four algorithms as reference baselines. Understanding what each does and when each works is essential before choosing which to train.

BC (Behavioral Cloning)

The simplest baseline: a feedforward MLP policy that minimizes MSE between predicted and demonstrated actions. Observation: current robot state (and/or image). Action: delta end-effector pose. No temporal context beyond the current frame. BC performs reasonably on Lift (easy task, low multimodality) but degrades significantly on Can and Square because it has no memory of recent trajectory history and cannot handle demonstration variability.

When it works: Tasks with low variance demonstrations, short horizon (single grasp), no multi-step coordination requirements.

When it fails: Any task where the robot needs to remember what it just did (common in multi-step tasks), or where demonstrations vary in strategy.

BC-RNN (Behavioral Cloning with Recurrent Networks)

BC with an LSTM backbone instead of an MLP. The LSTM maintains a hidden state across timesteps, giving the policy temporal memory. This significantly improves performance on multi-step tasks like Can and Transport where knowing the history of arm movements is essential for deciding what to do next. BC-RNN with 2-layer LSTM (400 hidden units per layer) is the strongest standard BC baseline and the most commonly reported number in imitation learning papers.

When it works: Multi-step tasks, tasks with temporal dependencies, tasks where the robot needs to remember previous contact states.

When it fails: Tasks with high multimodality (multiple valid strategies) — the LSTM averages across strategies in its hidden state, leading to mode-averaging artifacts on the MH dataset.

HBC (Hierarchical Behavioral Cloning)

A two-level hierarchical policy: a high-level "planner" predicts subgoal states (e.g., intermediate configurations of the robot), and a low-level "controller" learns to reach each subgoal. The hierarchy provides an inductive structure that helps on long-horizon tasks — the planner breaks the task into stages, and the controller handles the fine-grained motor control within each stage. HBC typically outperforms BC-RNN on Transport and Square (long-horizon tasks) but requires more careful hyperparameter tuning, particularly the subgoal horizon length.

IRIS (Implicit Representations for Imitation with Subgoals)

IRIS extends HBC with improved subgoal representation learning. Instead of predicting raw robot state subgoals, IRIS learns a latent representation of workspace states that is more abstract and easier for the planner to work with. It uses a VAE to learn the subgoal latent space from the demonstration data. IRIS achieves the strongest results of the four reference algorithms on the full RoboMimic benchmark, particularly on MH (mixed human quality) datasets where its subgoal structure provides robustness to demonstration variability.

Adding External Algorithms to RoboMimic

RoboMimic's framework makes it straightforward to add your own algorithm as a new "algo" class. Diffusion policy, ACT, and most recent imitation learning methods have been adapted to use the RoboMimic data format and evaluation protocol. The key integration requirement is implementing the RolloutPolicy interface — a get_action(obs) method that takes a standardized observation dict and returns a delta end-effector action. SVRC maintains RoboMimic wrappers for both diffusion policy and ACT in our internal platform.

How to Run RoboMimic Experiments

# Install RoboMimic
pip install robomimic

# Or from source for the latest version
git clone https://github.com/ARISE-Initiative/robomimic.git
cd robomimic && pip install -e .

# Install robosuite (simulation environment)
pip install robosuite

# Download a dataset (example: Lift, proficient human, low-dim observations)
python robomimic/scripts/download_datasets.py \
  --tasks lift \
  --dataset_types ph \
  --hdf5_types low_dim

# Generate a training config
python robomimic/scripts/generate_paper_configs.py \
  --algo bc_rnn \
  --task lift \
  --dataset_type ph \
  --obs_type low_dim \
  --output_dir /tmp/robomimic_configs

# Train
python robomimic/scripts/train.py \
  --config /tmp/robomimic_configs/bc_rnn/lift/ph/low_dim.json

# Evaluate a trained checkpoint
python robomimic/scripts/run_trained_agent.py \
  --agent /path/to/checkpoint.pth \
  --n_eval_episodes 50 \
  --video_path /tmp/eval_video.mp4

Training BC-RNN on Lift PH low-dim takes approximately 30 minutes on a single RTX 3090. Training on image observations takes 4–6 hours for the same task. Transport MH image is the longest training run — typically 24–48 hours depending on GPU memory (larger batch size = faster training).

Key Configuration Parameters

Parameter	Default	Notes
`train.num_epochs`	2000	200 demos × 2000 epochs = ~400k gradient steps; sufficient for low-dim tasks
`train.batch_size`	100	Increase to 256 or 512 for image tasks if VRAM allows
`algo.rnn.hidden_dim`	400	For BC-RNN; larger values help on complex tasks but slow training
`observation.encoder.core_kwargs.feature_dim`	64	For image tasks; feature dimension from ResNet encoder
`experiment.rollout.n`	50	Number of evaluation rollouts — 50 is the standard; do not reduce below 20

Reference Benchmark Results

The following success rates are from the original RoboMimic paper (Mandlekar et al., 2021) plus subsequent community updates. These are the numbers your algorithm should be measured against.

Task / Split	BC	BC-RNN	HBC	IRIS
Lift PH (low-dim)	100%	100%	100%	100%
Lift MH (low-dim)	98%	100%	100%	100%
Can PH (low-dim)	92%	100%	100%	100%
Can MH (low-dim)	12%	48%	36%	74%
Square PH (low-dim)	70%	80%	100%	98%
Square MH (low-dim)	0%	2%	10%	50%
Transport PH (low-dim)	0%	22%	88%	98%
Transport MH (low-dim)	0%	6%	22%	58%

The most instructive comparison is Can MH: BC drops from 92% (PH) to 12% (MH), while IRIS drops only from 100% to 74%. This gap — the "MH degradation" — is the key benchmark signal. An algorithm that maintains performance on MH is learning to filter low-quality demonstrations and handle multimodal data, which directly predicts real-world performance where you cannot guarantee operator quality.

SVRC Hardware Compatibility

RoboMimic's simulation tasks use a Franka Panda arm (7-DOF) in robosuite. For physical robot transfer, SVRC supports RoboMimic-format data collection on the OpenArm Base and AgileX PiPER arm, both of which use the same 7-DOF delta end-effector action space as the RoboMimic reference implementation.

The physical-to-simulation gap in RoboMimic is well-understood: policies trained on simulated RoboMimic data transfer to physical robots at roughly 60–75% of their sim success rate, depending on the task. Precision insertion tasks (Square) transfer worst; pick-and-place tasks (Can, Lift) transfer best. SVRC's approach is to use RoboMimic sim experiments for algorithm selection and hyperparameter search, then collect physical demonstrations for the final deployment policy.

Collecting Custom RoboMimic-Compatible Datasets

If you want to benchmark your algorithm on a custom task (not one of the four default tasks), RoboMimic's data collection framework lets you create new tasks in robosuite and collect demonstrations via spacemouse teleoperation. The HDF5 schema is documented and you can also convert physical robot demonstrations to RoboMimic format.

SVRC's data collection service supports custom RoboMimic task creation and demonstration collection. We have run custom RoboMimic benchmark tasks for pharmaceutical lab clients (precision liquid handling, sample sorting), electronics assembly clients (connector insertion, PCB handling), and academic labs (custom manipulation scenarios for algorithm development). Datasets are delivered in HDF5 format compatible with the RoboMimic training scripts with no format conversion required.

The standard SVRC dataset package for a custom RoboMimic task includes: 200 PH demonstrations from a single trained operator, 200 MH demonstrations from 3 operators of different skill levels (to enable the MH benchmark), and a custom robosuite environment definition that matches your physical task geometry. Turnaround is typically 2–3 weeks. Contact us to discuss your benchmark requirements.