RoboMimic: Complete Guide to the Robot Imitation Learning Benchmark (2026)
RoboMimic is the standard benchmark for evaluating imitation learning and offline RL algorithms on robotic manipulation tasks. If you are developing or evaluating a new policy learning method, your results need to be reported against RoboMimic. This guide covers the datasets, algorithms, how to run experiments, and how to collect new RoboMimic-compatible data using SVRC hardware.
What Is RoboMimic?
RoboMimic is a framework and benchmark for robotic learning from demonstrations, developed at Stanford University by Ajay Mandlekar et al. and published at CoRL 2021. It provides three things in one package: a curated collection of demonstration datasets with varying operator quality, a set of imitation learning and offline RL algorithm implementations, and a standardized evaluation protocol that makes cross-paper comparisons meaningful.
Before RoboMimic, comparing imitation learning papers was difficult because every group collected their own proprietary datasets with different operators, different quality standards, and different evaluation protocols. A paper claiming 80% success rate might be using easy demonstrations from experts, while another claiming 60% used messy demonstrations from novices — the numbers were not comparable. RoboMimic solved this by providing a fixed public dataset that everyone trains on, making benchmark numbers directly comparable across papers and implementations.
RoboMimic uses the robosuite simulation environment (built on MuJoCo) for evaluation, which means you can reproduce results exactly without physical hardware — though the framework supports physical robot transfer as well. SVRC uses RoboMimic as the primary benchmark for evaluating algorithm performance before deploying policies to physical hardware.
The RoboMimic Datasets
RoboMimic includes four tasks, each with multiple dataset variants collected from operators of different skill levels. This multi-quality structure is the benchmark's most important design decision: it lets you test how robust your algorithm is to dataset quality, which matters enormously in practice because real-world data collection produces mixed-quality demonstrations.
Tasks
| Task | Description | Difficulty | Demo Count |
|---|---|---|---|
| Lift | Pick a small cube off a table and raise it above a height threshold | Easy | 200 per variant |
| Can | Pick a soda can from a bin and place it in a designated target region | Medium | 200 per variant |
| Square | Pick a square nut and fit it onto a peg — requires precise insertion alignment | Hard | 200 per variant |
| Transport | Two-arm task: pick a hammer from a closed bin, transfer between arms, place in target bin | Very Hard | 200 per variant |
Dataset Variants
Each task has four dataset variants that reflect different levels of operator consistency:
- PH (Proficient-Human): Demonstrations from a single highly skilled operator who completes every trial successfully with consistent, smooth motions. This is the easiest dataset to learn from — low variance, high quality.
- MH (Multi-Human): Demonstrations from 6 operators of varying skill levels: 1 "worse" operator, 3 "okay" operators, and 2 "better" operators (the proficient human from PH). The operators use different strategies and have different success rates. This is the most realistic dataset — real-world collections always have mixed quality.
- MH-subsets: Separate downloads for the worse/okay/better subsets of MH, letting you study how quality filters affect performance.
- RH (Rendered-Human): Demonstrations rendered with image observations rather than state-based observations, for evaluating visual policies.
For most research purposes, the MH dataset is the most informative benchmark because it tests both the algorithm's ability to filter noisy demonstrations and its ability to learn from multimodal data. Algorithms that score high on PH but much lower on MH have an overfitting problem with clean data — they do not generalize to real-world collection conditions.
File Format and Download
RoboMimic datasets are stored in HDF5 format with a standard schema. Each file is named by task and operator quality:
Download the datasets from the official RoboMimic website at https://robomimic.github.io/docs/datasets/robomimic_v0.1.html. Files range from 50MB (lift_ph_low_dim) to 14GB (transport_mh_image). The low-dim variants (state observations only) train much faster and are sufficient for algorithm development; the image variants are required for visual policy evaluation.
Algorithms: BC, BC-RNN, HBC, and IRIS
RoboMimic implements four algorithms as reference baselines. Understanding what each does and when each works is essential before choosing which to train.
BC (Behavioral Cloning)
The simplest baseline: a feedforward MLP policy that minimizes MSE between predicted and demonstrated actions. Observation: current robot state (and/or image). Action: delta end-effector pose. No temporal context beyond the current frame. BC performs reasonably on Lift (easy task, low multimodality) but degrades significantly on Can and Square because it has no memory of recent trajectory history and cannot handle demonstration variability.
When it works: Tasks with low variance demonstrations, short horizon (single grasp), no multi-step coordination requirements.
When it fails: Any task where the robot needs to remember what it just did (common in multi-step tasks), or where demonstrations vary in strategy.
BC-RNN (Behavioral Cloning with Recurrent Networks)
BC with an LSTM backbone instead of an MLP. The LSTM maintains a hidden state across timesteps, giving the policy temporal memory. This significantly improves performance on multi-step tasks like Can and Transport where knowing the history of arm movements is essential for deciding what to do next. BC-RNN with 2-layer LSTM (400 hidden units per layer) is the strongest standard BC baseline and the most commonly reported number in imitation learning papers.
When it works: Multi-step tasks, tasks with temporal dependencies, tasks where the robot needs to remember previous contact states.
When it fails: Tasks with high multimodality (multiple valid strategies) — the LSTM averages across strategies in its hidden state, leading to mode-averaging artifacts on the MH dataset.
HBC (Hierarchical Behavioral Cloning)
A two-level hierarchical policy: a high-level "planner" predicts subgoal states (e.g., intermediate configurations of the robot), and a low-level "controller" learns to reach each subgoal. The hierarchy provides an inductive structure that helps on long-horizon tasks — the planner breaks the task into stages, and the controller handles the fine-grained motor control within each stage. HBC typically outperforms BC-RNN on Transport and Square (long-horizon tasks) but requires more careful hyperparameter tuning, particularly the subgoal horizon length.
IRIS (Implicit Representations for Imitation with Subgoals)
IRIS extends HBC with improved subgoal representation learning. Instead of predicting raw robot state subgoals, IRIS learns a latent representation of workspace states that is more abstract and easier for the planner to work with. It uses a VAE to learn the subgoal latent space from the demonstration data. IRIS achieves the strongest results of the four reference algorithms on the full RoboMimic benchmark, particularly on MH (mixed human quality) datasets where its subgoal structure provides robustness to demonstration variability.
Adding External Algorithms to RoboMimic
RoboMimic's framework makes it straightforward to add your own algorithm as a new "algo" class. Diffusion policy, ACT, and most recent imitation learning methods have been adapted to use the RoboMimic data format and evaluation protocol. The key integration requirement is implementing the RolloutPolicy interface — a get_action(obs) method that takes a standardized observation dict and returns a delta end-effector action. SVRC maintains RoboMimic wrappers for both diffusion policy and ACT in our internal platform.
How to Run RoboMimic Experiments
Training BC-RNN on Lift PH low-dim takes approximately 30 minutes on a single RTX 3090. Training on image observations takes 4–6 hours for the same task. Transport MH image is the longest training run — typically 24–48 hours depending on GPU memory (larger batch size = faster training).
Key Configuration Parameters
| Parameter | Default | Notes |
|---|---|---|
train.num_epochs | 2000 | 200 demos × 2000 epochs = ~400k gradient steps; sufficient for low-dim tasks |
train.batch_size | 100 | Increase to 256 or 512 for image tasks if VRAM allows |
algo.rnn.hidden_dim | 400 | For BC-RNN; larger values help on complex tasks but slow training |
observation.encoder.core_kwargs.feature_dim | 64 | For image tasks; feature dimension from ResNet encoder |
experiment.rollout.n | 50 | Number of evaluation rollouts — 50 is the standard; do not reduce below 20 |
Reference Benchmark Results
The following success rates are from the original RoboMimic paper (Mandlekar et al., 2021) plus subsequent community updates. These are the numbers your algorithm should be measured against.
| Task / Split | BC | BC-RNN | HBC | IRIS |
|---|---|---|---|---|
| Lift PH (low-dim) | 100% | 100% | 100% | 100% |
| Lift MH (low-dim) | 98% | 100% | 100% | 100% |
| Can PH (low-dim) | 92% | 100% | 100% | 100% |
| Can MH (low-dim) | 12% | 48% | 36% | 74% |
| Square PH (low-dim) | 70% | 80% | 100% | 98% |
| Square MH (low-dim) | 0% | 2% | 10% | 50% |
| Transport PH (low-dim) | 0% | 22% | 88% | 98% |
| Transport MH (low-dim) | 0% | 6% | 22% | 58% |
The most instructive comparison is Can MH: BC drops from 92% (PH) to 12% (MH), while IRIS drops only from 100% to 74%. This gap — the "MH degradation" — is the key benchmark signal. An algorithm that maintains performance on MH is learning to filter low-quality demonstrations and handle multimodal data, which directly predicts real-world performance where you cannot guarantee operator quality.
SVRC Hardware Compatibility
RoboMimic's simulation tasks use a Franka Panda arm (7-DOF) in robosuite. For physical robot transfer, SVRC supports RoboMimic-format data collection on the OpenArm Base and AgileX PiPER arm, both of which use the same 7-DOF delta end-effector action space as the RoboMimic reference implementation.
The physical-to-simulation gap in RoboMimic is well-understood: policies trained on simulated RoboMimic data transfer to physical robots at roughly 60–75% of their sim success rate, depending on the task. Precision insertion tasks (Square) transfer worst; pick-and-place tasks (Can, Lift) transfer best. SVRC's approach is to use RoboMimic sim experiments for algorithm selection and hyperparameter search, then collect physical demonstrations for the final deployment policy.
Collecting Custom RoboMimic-Compatible Datasets
If you want to benchmark your algorithm on a custom task (not one of the four default tasks), RoboMimic's data collection framework lets you create new tasks in robosuite and collect demonstrations via spacemouse teleoperation. The HDF5 schema is documented and you can also convert physical robot demonstrations to RoboMimic format.
SVRC's data collection service supports custom RoboMimic task creation and demonstration collection. We have run custom RoboMimic benchmark tasks for pharmaceutical lab clients (precision liquid handling, sample sorting), electronics assembly clients (connector insertion, PCB handling), and academic labs (custom manipulation scenarios for algorithm development). Datasets are delivered in HDF5 format compatible with the RoboMimic training scripts with no format conversion required.
The standard SVRC dataset package for a custom RoboMimic task includes: 200 PH demonstrations from a single trained operator, 200 MH demonstrations from 3 operators of different skill levels (to enable the MH benchmark), and a custom robosuite environment definition that matches your physical task geometry. Turnaround is typically 2–3 weeks. Contact us to discuss your benchmark requirements.