BridgeData vs RoboMimic: Manipulation Dataset Showdown

BridgeData and RoboMimic show up in almost every imitation-learning paper, but they are not interchangeable. BridgeData V2 is a large real-world dataset on a low-cost WidowX robot, aimed at scaling up diverse real demos. RoboMimic is a curated simulated suite on Lift, Can, Square, Transport, and ToolHang tasks, designed to answer rigorous questions about proficient-human, multi-human, and machine-generated demonstration quality. This guide makes the tradeoffs explicit.

TL;DR

  • Real-world: BridgeData V2 — real WidowX demonstrations across 24 environments.
  • Simulation: RoboMimic — Robosuite tasks with clean reset conditions and PH / MH / MG splits.
  • Scale: BridgeData V2 is much larger in raw count (~60K trajectories).
  • Controlled study: RoboMimic wins — its PH / MH / MG split is still the canonical way to study demonstration heterogeneity.
  • Hardware accessible: Both rely on low-cost arms (WidowX 250 for Bridge; sim-only for RoboMimic).

Comparison table

Attribute BridgeData V2 RoboMimic
Released2023 (UC Berkeley RAIL)2021 (ARISE Initiative, Stanford)
DomainReal-worldSimulation (Robosuite) + optional real
Trajectories~60,000~1,500 high-quality per task split (PH/MH) + MG generated
Tasks13 manipulation skill categories across 24 environmentsLift, Can, Square, Transport, ToolHang
EmbodimentWidowX 250 (6-DoF low-cost arm)Sim: Franka Panda; reference policies support Sawyer etc.
Language conditioningYes (natural language instructions)Not required; tasks are defined by state
Demonstration splitsOne combined splitPH (proficient-human), MH (multi-human), MG (machine-generated)
Approx. size on disk~400 GB~30-60 GB depending on modalities
LicenseCC-BY-4.0MIT
Primary use caseTraining real-world VLA / language-conditioned policiesControlled imitation-learning ablations and policy research
PaperarXiv:2308.12952arXiv:2108.03298

What BridgeData V2 is good for

BridgeData V2 is the successor to the original Bridge dataset (Ebert et al. 2022). It was built on an explicit bet: that useful generalist policies are better trained on large real-world datasets with modest individual quality than on small, clean simulator datasets. The dataset contains roughly 60,000 teleoperation trajectories collected on a WidowX 250 arm across 24 distinct tabletop environments (kitchens, toy scenes, drawer scenes, sink scenes, and so on). Each trajectory carries a natural-language label describing the intended skill.

The 13 skill categories cover pick-and-place, pushing, folding, flipping, tool use, and other primitives. Because all trajectories share the same low-cost hardware, the action space and camera rig are consistent — which makes BridgeData one of the cleanest real corpora for training language-conditioned visuomotor policies. It has become a standard component in Open X-Embodiment training mixes, and it is frequently used as the "real-world anchor" when co-training with simulation data.

What RoboMimic is good for

RoboMimic is a research framework first and a dataset second. Its central contribution was a careful empirical study: how does imitation-learning policy performance depend on the quality and heterogeneity of demonstrations? The project introduced three demonstration splits that every subsequent IL paper now borrows. PH (proficient-human) contains clean demos from a single expert operator. MH (multi-human) contains demos from six operators with varying skill levels — the hard distribution. MG (machine-generated) contains demos rolled out by an RL policy, useful for studying learning from suboptimal data.

These splits exist for five Robosuite tasks of increasing difficulty: Lift (trivial), Can (pick-and-place with pose variation), Square (peg insertion), Transport (bimanual handoff), and ToolHang (long-horizon, high-precision). The resulting benchmark is the reason we have a shared vocabulary for reasoning about BC, BC-RNN, BC-Transformer, IRIS, and diffusion-policy performance across demonstration regimes.

Tradeoffs in 2026

Choose BridgeData if you are training a real-world language-conditioned policy and want a large real demonstration corpus on consistent hardware. It is also the right starting point if you are building a WidowX-based product or pilot — the policies fine-tune cleanly because the distribution matches your target embodiment. BridgeData is not the right tool for controlled ablation studies; the long tail of collection conditions is real-world-messy.

Choose RoboMimic if you want rigorous, repeatable comparisons between policy architectures or training recipes. The PH / MH / MG split is still the easiest way to demonstrate that a new method handles heterogeneous demos better than BC. It is also the right tool when your reviewer asks for a simulator benchmark and you do not want the noise of real-world variance.

Use both if you are building a generalist system: pretrain on OXE (which includes BridgeData), then run RoboMimic as a diagnostic for how your policy degrades under suboptimal demos, and finally fine-tune on target task data.

Hardware and infrastructure

BridgeData requires a WidowX 250 or compatible arm if you want to reproduce or extend the dataset. WidowX arms are low-cost (~$3K) and available through our store, which is part of why the dataset is so reproducible. RoboMimic requires no hardware — it runs entirely in Robosuite / MuJoCo. Both can be consumed through the LeRobot ecosystem.

If you are extending either dataset with your own demonstrations, our teleoperation data services can replicate the capture protocol. We have run BridgeData-style collection campaigns and RoboMimic-style simulated data generation for enterprise customers.

License reality

Both datasets are permissively licensed (CC-BY for BridgeData, MIT for RoboMimic). You can train commercial policies on either, distribute derived checkpoints, and build products. BridgeData requires attribution to the Berkeley RAIL lab; RoboMimic requires attribution to the ARISE Initiative. Check before shipping.

The PH / MH / MG story in detail

It is worth spending a paragraph on why RoboMimic's three-split design is so influential. Before RoboMimic, imitation-learning papers typically trained on a single homogeneous demonstration set and reported a single success rate. That told you almost nothing about how the method would behave in the real world, where demonstrations come from multiple operators with different skill levels and occasionally from partially trained policies. PH, MH, and MG respectively test the clean case, the heterogeneous human case, and the suboptimal case. Every modern imitation-learning method — BC-RNN, BC-Transformer, IRIS, ACT, Diffusion Policy, VINN — has been characterized on these splits, which means researchers share a vocabulary about where methods succeed and where they break.

BridgeData does not try to do this. Its split is implicitly "whatever humans collected" — realistic, but not a controlled study. That is a feature, not a bug, but it is why you cannot substitute BridgeData for RoboMimic in an architecture ablation. The two datasets live at different points on the quality-vs-scale curve and answer different questions.

Limitations to be honest about

BridgeData's main limitation is that WidowX is a 6-DoF arm with a simple parallel gripper. Policies trained on BridgeData transfer poorly to 7-DoF arms with more capable grippers unless you retrain the action head. The video quality is also modest — standard RGB at ~30 Hz without depth.

RoboMimic's limitation is the sim-to-real gap. Policies trained purely on Robosuite demos do not transfer out of the box to real hardware. RoboMimic is best used for research questions about imitation learning, not as a pretraining corpus for deployment. For deployment, you want real data — which is where BridgeData and DROID shine.

Typical reported numbers

On RoboMimic, strong imitation-learning baselines from 2024-2025 (BC-Transformer, diffusion policy, ACT variants) typically achieve near-perfect success on Lift and Can PH, 80-95 percent on Square and Transport PH, and 40-70 percent on ToolHang PH. The MH split drops success by 10-25 absolute points across the board, and the MG split is dramatically harder — few methods exceed 30 percent on ToolHang MG. Those numbers are diagnostic: they tell you which failure mode your method actually fixes.

On BridgeData, absolute numbers are harder to compare because the task distribution is real-world and the evaluation protocol varies by paper. What we can say is that policies pretrained on Open X-Embodiment (which includes BridgeData) and fine-tuned on task-specific BridgeData subsets routinely hit 50-70 percent success on in-distribution WidowX tasks. That is not a benchmark number in the traditional sense, but it is the metric that matters for deployment.

Co-training recipes

A popular recipe in 2025-2026 is BridgeData + RoboMimic co-training: use BridgeData as the real-world anchor, mix in RoboMimic PH for clean demonstration signal, and co-train a single policy. This works surprisingly well, even though the embodiments differ — the shared manipulation structure transfers. If you are building a research policy for a paper, co-training with a small RoboMimic mix is an easy way to stabilize training on noisy real data. It will not help deployment, but it helps benchmark numbers.

Our recommendation

For a modern robot-learning stack in 2026, we recommend using RoboMimic as a diagnostic benchmark and BridgeData as a training corpus. Run RoboMimic PH/MH/MG to characterize your method, then train or fine-tune on BridgeData (plus OXE and your own data) for deployment. Neither should be used alone. If you need more data than BridgeData provides for your specific scenario, our teleoperation pipelines can collect additional WidowX or Franka demos in the same format so they mix cleanly with your existing training set.

Related resources

Collecting your own BridgeData-style corpus?

We stock WidowX arms and run teleoperation campaigns that match the BridgeData protocol.