Open X-Embodiment vs DROID: Which Robot Dataset Should You Use?
Open X-Embodiment and DROID are the two largest open robot-learning datasets in 2026. They sit at opposite ends of a design spectrum: Open X-Embodiment is a federated mega-corpus assembled from 33 research labs across 22 different robot types, while DROID is a single-embodiment, single-protocol dataset collected on a Franka Panda across hundreds of real scenes. Which one to use depends on whether you care about scale-for-pretraining or consistency-for-fine-tuning.
TL;DR
- Raw trajectory count: Open X-Embodiment wins — 1M+ episodes vs DROID's ~76K.
- Embodiment diversity: Open X-Embodiment wins — 22 robot types vs DROID's single Franka Panda.
- Scene diversity per trajectory: DROID wins — 564 scenes across 13 institutions with consistent sensors.
- Data consistency: DROID wins by a wide margin — same hardware, same camera rig, same episode format.
- Ease of training a policy: DROID wins — no cross-embodiment action normalization needed.
- Ease of training a foundation model: Open X-Embodiment — it is what RT-X was built on.
Head-to-head comparison table
| Attribute | Open X-Embodiment | DROID |
|---|---|---|
| Released | Oct 2023 (Google DeepMind + 33 partners) | Mar 2024 (Stanford / UC Berkeley / TRI + 13 partners) |
| Trajectories | 1,000,000+ episodes | ~76,000 trajectories (~350 hours) |
| Embodiments | 22 robot types (Franka, Google Robot, xArm, WidowX, UR5, etc.) | 1 (Franka Emika Panda with parallel-jaw gripper) |
| Scenes / environments | Mix of labs and uncontrolled environments | 564 scenes across 13 institutions |
| Collection duration | Aggregate of many prior datasets (years) | ~18 months of coordinated collection |
| Cameras | Varies by source dataset | 2x ZED 2 stereo + 1x ZED Mini wrist (consistent) |
| Format | Unified RLDS (TensorFlow Datasets) | RLDS + HDF5; also on HuggingFace / LeRobot |
| Approx. size on disk | Multiple TB (varies by component) | ~1.7 TB raw |
| License | Per-component (mostly CC-BY and Apache-2.0) | MIT + CC-BY-4.0 |
| Reference models | RT-1-X, RT-2-X, OpenVLA, Octo | DROID diffusion policies; used by pi0, OpenVLA fine-tunes |
| Paper | arXiv:2310.08864 | arXiv:2403.12945 |
The design philosophy gap
Open X-Embodiment (OXE) was conceived as a scaling experiment: if we pool every open manipulation dataset into one standardized format, can we train a single policy that transfers across 22 different robots? The answer, as demonstrated by RT-1-X and RT-2-X, was a cautious yes. OXE is therefore best understood as pretraining corpus, not as a curated benchmark. The trajectories vary wildly in quality, camera rigs, lighting, action frequencies, and gripper conventions; consuming them requires careful normalization through the shared RLDS schema.
DROID took the opposite approach. Rather than aggregating heterogeneous data, DROID standardized a single hardware rig (Franka Panda, two ZED 2 cameras for scene, one ZED Mini mounted on the wrist) and shipped it to 13 research institutions. Over roughly 18 months, those labs collected around 76,000 in-the-wild trajectories in offices, kitchens, labs, hallways, and homes. Every trajectory has the same observation shape, the same action convention, and the same language annotation format. This makes DROID vastly easier to train on than OXE, and it pays off in cleaner fine-tunes.
When to train on which
Pretrain on OXE if you are building a generalist cross-embodiment policy. RT-X and OpenVLA are existence proofs that OXE-scale pretraining unlocks behaviors that no single-embodiment dataset can. If you plan to fine-tune on your own embodiment anyway, OXE gives you the biggest possible prior.
Fine-tune on DROID if your target hardware is a Franka Panda or a near-equivalent 7-DoF arm, or you are training a diffusion policy and want high-quality action labels with consistent gripper semantics. DROID is also the better choice if you care about in-the-wild generalization — its 564 scenes are more visually varied than most OXE components.
Do both if you have the compute. The now-standard recipe is "OXE pretrain, DROID fine-tune, task demos fine-tune-2." This is what pi0 and several 2025 humanoid policies used, and it continues to set the standard in 2026.
Compute required to train
OXE pretraining on OpenVLA scale (7B parameters) originally used 64 A100 GPUs for roughly two weeks. That is a real capital cost — on the order of $100K on public cloud. If you only want to fine-tune a pretrained checkpoint, a single 8xA100 or 8xH100 node for a day or two is typical. DROID fine-tunes can be done on 1-4 GPUs because the dataset is much smaller; many teams train full DROID diffusion policies on a single 8x3090 workstation.
For inference, both datasets produce policies that comfortably run on a consumer GPU. We do not recommend training from scratch on either dataset unless you have a specific research reason; the pretrained checkpoints on HuggingFace are extremely strong starting points.
Licensing: read the fine print
OXE is not a single license. It is a registry of component datasets, each with its own terms. Most components are CC-BY-4.0 or Apache-2.0, which means you can train commercial models on them and ship derivative weights. A handful of components are research-only; if you are building a product, audit the component list against your legal posture. DROID is straightforward — MIT + CC-BY-4.0 on the data, permissive for commercial use, with attribution. If license cleanliness matters for your deployment, DROID is the safer pick.
Data format and tooling
Both datasets use RLDS as the canonical format, and both are mirrored on HuggingFace Hub through the LeRobot ecosystem. If you are starting fresh in 2026, LeRobot is the path of least resistance — it gives you a uniform PyTorch interface over both. If you want the original TFDS-based pipeline, both datasets ship reproducible download scripts. Our data platform ingests both formats and also supports converting your own collected demos into OXE-compatible RLDS.
Gotchas worth knowing
OXE action spaces are not identical. Even within the shared RLDS schema, different components use different action conventions (end-effector vs joint, delta vs absolute, different gripper encodings). The OXE team published a normalization layer, but it is not perfect. Plan to spend real engineering time on preprocessing before you get a clean training run.
DROID wrist camera is low-resolution. The wrist ZED Mini delivers 720p stereo; the scene cameras deliver 1080p. Most modern VLA architectures downsample anyway, but if you are training a high-resolution perception model, be aware of the ceiling.
Both have class imbalance. OXE is dominated by a handful of large components (notably the Google Robot datasets). DROID is dominated by a few institutions that collected the majority of hours. Mix carefully.
Evaluation: how do policies trained on each compare?
When OpenVLA was released, its ablations included checkpoints trained on OXE-only, DROID-only, and OXE+DROID. On in-distribution DROID tasks, the DROID-only checkpoint was competitive with OXE-only, sometimes better. On out-of-distribution evaluation (unseen scenes or objects), the OXE-only checkpoint generalized further. The combined OXE+DROID checkpoint was strictly better than either. This is the empirical basis for the now-standard "OXE pretrain, DROID fine-tune" recipe, and it has held up in follow-up work through 2025 and into 2026.
The gap is particularly pronounced on long-tail behaviors. DROID alone handles standard pick-and-place reliably but struggles on anything outside its distribution. OXE alone handles a wider variety of tasks but produces jerkier action traces because the underlying action distributions are not normalized across components. Combining the two smooths out both failure modes.
Storage, streaming, and practical download
Downloading the full OXE corpus is a multi-terabyte operation. Most teams do not bother — they stream the components they care about through TFDS or use the HuggingFace mirror with streaming mode. DROID is more manageable at ~1.7 TB raw. If you are working on a smaller rig, HuggingFace LeRobot distributes a preprocessed 100-trajectory DROID subset (DROID-100) that is excellent for prototyping pipelines before you commit to the full pull. For production training, bring the data local — remote streaming adds latency that hurts training throughput.
What we recommend
For most teams building a deployable manipulation policy in 2026, our advice is: start from an OXE-pretrained OpenVLA or pi0 checkpoint, fine-tune on DROID to stabilize action distributions, and then fine-tune again on your own task-specific teleoperation data. If you do not have task-specific data yet, our custom data collection can bootstrap you in a few weeks. If you are still choosing hardware, the Franka Panda is the obvious choice for DROID compatibility, but WidowX arms also work well if you plan to lean on BridgeData — see our BridgeData vs RoboMimic comparison for that path.