BridgeData V2: Berkeley's Scalable Manipulation Corpus
A practitioner's guide to BridgeData V2 — the 60,096-trajectory WidowX dataset from UC Berkeley's RAIL Lab that powers Bridge-trained VLAs, Octo fine-tunes, and the OpenX mix.
TL;DR
| Metric | Value |
|---|---|
| Task count | 13 skills across ~100 task variants |
| Robots | WidowX 250 6DOF arm (consumer-grade, ~$6K) |
| Modalities | RGB-D (Intel RealSense), proprioception, gripper, language instruction |
| License | CC-BY 4.0 |
| Size | 60,096 trajectories, ~400 GB raw |
| Environments | 24 scenes including toy kitchens, sinks, and tabletops |
What is BridgeData V2?
BridgeData V2 is the successor to the original 2021 BridgeData release from UC Berkeley's Robotic AI & Learning Lab (RAIL). It was built by Homer Walke, Pranav Atreya, and co-authors under Sergey Levine to answer a deceptively simple question: can a single manipulation policy learn to follow natural language instructions across dozens of environments and skills, without any environment-specific tuning? To answer it, the RAIL team teleoperated a low-cost WidowX 250 6DOF arm across 24 scenes for 60,096 demonstrations, covering pick-and-place, pushing, sweeping, stacking, folding, drawer opening, and several other skills.
Two design decisions made BridgeData V2 disproportionately influential. First, every trajectory is annotated with a crowdsourced natural language instruction — not a fixed task ID — which means policies trained on Bridge can be conditioned on free-form text at inference time. Second, the choice of the $6,000 WidowX arm (rather than a Franka or UR5) kept the hardware cost low enough that dozens of follow-on labs could reproduce the setup, and the data has become a de facto standard for low-cost manipulation research.
BridgeData V2 is now one of the largest individual contributions to the Open X-Embodiment mix, and it is the single most popular fine-tuning target for papers that want to demonstrate real-world results without owning industrial hardware. OpenVLA, Octo, and several Diffusion Policy variants publish BridgeData V2 numbers as their canonical real-world result.
How to download & load
The dataset is distributed as NumPy shards from Berkeley and as RLDS on Google Cloud Storage via the Open X-Embodiment bucket:
# Raw NumPy shards (~400 GB)
wget -r https://rail.eecs.berkeley.edu/datasets/bridge_release/data/demos_8_17.zip
unzip demos_8_17.zip
# Or stream the RLDS copy from OXE (recommended for VLA training)
pip install tensorflow_datasets
import tensorflow_datasets as tfds
ds = tfds.load("bridge", data_dir="gs://gresearch/robotics", split="train")
for ep in ds.take(1):
for step in ep["steps"]:
print(step["observation"]["image_0"].shape,
step["observation"]["state"].shape,
step["action"].shape,
step["language_instruction"])
# Install the reference training stack
git clone https://github.com/rail-berkeley/bridge_data_v2.git
cd bridge_data_v2 && pip install -e .
For fine-tuning OpenVLA or Octo, point the config at bridge_orig in the OpenX mix and you can reproduce the published results on a single 8xA100 node.
Common use cases & model pairings
- Low-cost VLA evaluation. BridgeData V2 is the default real-world fine-tune for OpenVLA and Octo — any new VLA must publish Bridge numbers to be taken seriously.
- Language-conditioned policies. The free-form instruction annotations make it the go-to dataset for instruction-following research.
- Generalization studies. Because the same skill is repeated across 24 environments, it is easy to hold out scenes and measure zero-shot scene generalization.
- Diffusion Policy baselines. The ~4,000 episodes per skill map cleanly onto diffusion and transformer policies without curriculum tricks.
Benchmarks & leaderboards
BridgeData V2 evaluation is typically reported as in-distribution success rate on a held-out set of language instructions, plus out-of-distribution success on a set of novel object / novel scene combinations. OpenVLA reports roughly 70% in-distribution and 40-50% OOD success. See the Papers with Code BridgeData V2 entry and the RAIL Lab project page for the canonical eval protocol.
Technical deep dive: schema and action space
BridgeData V2 episodes are stored as per-trajectory directories, each containing a Python pickle of observations plus per-step RGB frames and a language instruction. Observations are recorded at 5 Hz (the WidowX teleoperation loop rate) and include two RGB streams — a third-person scene camera and an over-the-shoulder wrist camera — along with Intel RealSense depth (on a subset of scenes), the WidowX 6-dimensional end-effector pose, and a binary gripper state.
Actions are 7-dimensional: Cartesian delta end-effector position (xyz), Cartesian delta orientation (rpy), and gripper open/close. Because the action frequency is only 5 Hz, BridgeData V2 policies are trained with action chunking of 4-8 steps, which aligns nicely with Diffusion Policy's standard chunk size. The low control rate is a blessing for data collection (teleoperators hit higher success rates at slow speeds) and a mild curse for dynamic tasks (the dataset is not suitable for anything requiring sub-100ms reactive control).
Language instructions are crowdsourced via Amazon Mechanical Turk. Workers watched each trajectory and wrote a natural-language description of the task, which means instructions include typos, synonyms, and paraphrases. This noise is actually a feature — it forces policies to handle the kind of messy language you see in a real deployment.
Known limitations
- Single embodiment. All 60K trajectories are on a WidowX 250. Transferring Bridge-trained policies to a Franka or UR5 requires cross-embodiment fine-tuning.
- Tabletop bias. Scenes are all roughly sink-, stove-, or desk-height tabletops. Floor manipulation, vertical surfaces, and mobile manipulation are out of distribution.
- Low action rate. 5 Hz is fine for quasi-static manipulation but insufficient for dynamic tasks like pouring or catching.
- Success labels are heuristic. The teleoperator's judgement was the ground truth for success; a small fraction of episodes labeled as successful are actually partial failures.
FAQ
Do I need a WidowX to use BridgeData V2? Only if you want to do closed-loop real-world evaluation. Many groups train on Bridge and evaluate in simulation or on a different arm via cross-embodiment transfer.
How does BridgeData V2 relate to BridgeData V1? V2 is roughly 4x larger, adds 12 new scenes, re-teleoperates most V1 tasks with a cleaner setup, and ships language annotations on every episode (V1 had them only on a subset).
Can I combine BridgeData V2 with my own WidowX data? Yes — the schema is well-documented, and several third-party training pipelines support drop-in mixing.
Related datasets
- Open X-Embodiment — the larger mix that incorporates BridgeData V2
- DROID — Franka-based real-world counterpart at similar scale
- CALVIN — simulation benchmark for long-horizon language-conditioned manipulation
- Robomimic — single-task imitation learning sibling
- LIBERO — simulation benchmark for lifelong learning