BridgeData V2: Berkeley's Scalable Manipulation Corpus

A practitioner's guide to BridgeData V2 — the 60,096-trajectory WidowX dataset from UC Berkeley's RAIL Lab that powers Bridge-trained VLAs, Octo fine-tunes, and the OpenX mix.

TL;DR

MetricValue
Task count13 skills across ~100 task variants
RobotsWidowX 250 6DOF arm (consumer-grade, ~$6K)
ModalitiesRGB-D (Intel RealSense), proprioception, gripper, language instruction
LicenseCC-BY 4.0
Size60,096 trajectories, ~400 GB raw
Environments24 scenes including toy kitchens, sinks, and tabletops

What is BridgeData V2?

BridgeData V2 is the successor to the original 2021 BridgeData release from UC Berkeley's Robotic AI & Learning Lab (RAIL). It was built by Homer Walke, Pranav Atreya, and co-authors under Sergey Levine to answer a deceptively simple question: can a single manipulation policy learn to follow natural language instructions across dozens of environments and skills, without any environment-specific tuning? To answer it, the RAIL team teleoperated a low-cost WidowX 250 6DOF arm across 24 scenes for 60,096 demonstrations, covering pick-and-place, pushing, sweeping, stacking, folding, drawer opening, and several other skills.

Two design decisions made BridgeData V2 disproportionately influential. First, every trajectory is annotated with a crowdsourced natural language instruction — not a fixed task ID — which means policies trained on Bridge can be conditioned on free-form text at inference time. Second, the choice of the $6,000 WidowX arm (rather than a Franka or UR5) kept the hardware cost low enough that dozens of follow-on labs could reproduce the setup, and the data has become a de facto standard for low-cost manipulation research.

BridgeData V2 is now one of the largest individual contributions to the Open X-Embodiment mix, and it is the single most popular fine-tuning target for papers that want to demonstrate real-world results without owning industrial hardware. OpenVLA, Octo, and several Diffusion Policy variants publish BridgeData V2 numbers as their canonical real-world result.

How to download & load

The dataset is distributed as NumPy shards from Berkeley and as RLDS on Google Cloud Storage via the Open X-Embodiment bucket:

# Raw NumPy shards (~400 GB)
wget -r https://rail.eecs.berkeley.edu/datasets/bridge_release/data/demos_8_17.zip
unzip demos_8_17.zip

# Or stream the RLDS copy from OXE (recommended for VLA training)
pip install tensorflow_datasets
import tensorflow_datasets as tfds
ds = tfds.load("bridge", data_dir="gs://gresearch/robotics", split="train")
for ep in ds.take(1):
    for step in ep["steps"]:
        print(step["observation"]["image_0"].shape,
              step["observation"]["state"].shape,
              step["action"].shape,
              step["language_instruction"])

# Install the reference training stack
git clone https://github.com/rail-berkeley/bridge_data_v2.git
cd bridge_data_v2 && pip install -e .

For fine-tuning OpenVLA or Octo, point the config at bridge_orig in the OpenX mix and you can reproduce the published results on a single 8xA100 node.

Common use cases & model pairings

  • Low-cost VLA evaluation. BridgeData V2 is the default real-world fine-tune for OpenVLA and Octo — any new VLA must publish Bridge numbers to be taken seriously.
  • Language-conditioned policies. The free-form instruction annotations make it the go-to dataset for instruction-following research.
  • Generalization studies. Because the same skill is repeated across 24 environments, it is easy to hold out scenes and measure zero-shot scene generalization.
  • Diffusion Policy baselines. The ~4,000 episodes per skill map cleanly onto diffusion and transformer policies without curriculum tricks.

Benchmarks & leaderboards

BridgeData V2 evaluation is typically reported as in-distribution success rate on a held-out set of language instructions, plus out-of-distribution success on a set of novel object / novel scene combinations. OpenVLA reports roughly 70% in-distribution and 40-50% OOD success. See the Papers with Code BridgeData V2 entry and the RAIL Lab project page for the canonical eval protocol.

Technical deep dive: schema and action space

BridgeData V2 episodes are stored as per-trajectory directories, each containing a Python pickle of observations plus per-step RGB frames and a language instruction. Observations are recorded at 5 Hz (the WidowX teleoperation loop rate) and include two RGB streams — a third-person scene camera and an over-the-shoulder wrist camera — along with Intel RealSense depth (on a subset of scenes), the WidowX 6-dimensional end-effector pose, and a binary gripper state.

Actions are 7-dimensional: Cartesian delta end-effector position (xyz), Cartesian delta orientation (rpy), and gripper open/close. Because the action frequency is only 5 Hz, BridgeData V2 policies are trained with action chunking of 4-8 steps, which aligns nicely with Diffusion Policy's standard chunk size. The low control rate is a blessing for data collection (teleoperators hit higher success rates at slow speeds) and a mild curse for dynamic tasks (the dataset is not suitable for anything requiring sub-100ms reactive control).

Language instructions are crowdsourced via Amazon Mechanical Turk. Workers watched each trajectory and wrote a natural-language description of the task, which means instructions include typos, synonyms, and paraphrases. This noise is actually a feature — it forces policies to handle the kind of messy language you see in a real deployment.

Known limitations

  • Single embodiment. All 60K trajectories are on a WidowX 250. Transferring Bridge-trained policies to a Franka or UR5 requires cross-embodiment fine-tuning.
  • Tabletop bias. Scenes are all roughly sink-, stove-, or desk-height tabletops. Floor manipulation, vertical surfaces, and mobile manipulation are out of distribution.
  • Low action rate. 5 Hz is fine for quasi-static manipulation but insufficient for dynamic tasks like pouring or catching.
  • Success labels are heuristic. The teleoperator's judgement was the ground truth for success; a small fraction of episodes labeled as successful are actually partial failures.

FAQ

Do I need a WidowX to use BridgeData V2? Only if you want to do closed-loop real-world evaluation. Many groups train on Bridge and evaluate in simulation or on a different arm via cross-embodiment transfer.

How does BridgeData V2 relate to BridgeData V1? V2 is roughly 4x larger, adds 12 new scenes, re-teleoperates most V1 tasks with a cleaner setup, and ships language annotations on every episode (V1 had them only on a subset).

Can I combine BridgeData V2 with my own WidowX data? Yes — the schema is well-documented, and several third-party training pipelines support drop-in mixing.

Related datasets

  • Open X-Embodiment — the larger mix that incorporates BridgeData V2
  • DROID — Franka-based real-world counterpart at similar scale
  • CALVIN — simulation benchmark for long-horizon language-conditioned manipulation
  • Robomimic — single-task imitation learning sibling
  • LIBERO — simulation benchmark for lifelong learning

Need more data? Order custom collection

BridgeData V2 nails WidowX-scale research, but production policies usually need coverage on your robot, your objects, and your scenes. Rent a WidowX or Franka at our Mountain View lab, or commission a Bridge-style custom collection.