RT-X vs OpenVLA: Foundation Models Compared | Robotics Center of Silicon Valley

TL;DR RT-X (including RT-1-X and RT-2-X) is Google DeepMind's in-house family of action models, trained on Open X-Embodiment. RT-1-X checkpoints were released under Apache 2.0 with JAX/TF code; RT-2-X and successor models are proprietary and unavailable for external deployment. OpenVLA is the 7B MIT-licensed successor built by the community specifically because RT-2-X was closed — and it outperformed RT-2-X on several benchmarks with 7× fewer parameters. For anything commercial, OpenVLA is the realistic choice.

Why this comparison matters

"RT-X" is a family, not a single model, and that confusion kills a lot of procurement conversations. RT-1-X is the ~35M-parameter transformer policy the Open X-Embodiment authors trained and open-sourced to demonstrate cross-embodiment transfer. RT-2-X is the 55B PaLI-X-based VLA that showed large multimodal models can do robot control zero-shot. RT-2-X was never released, and later DeepMind models (Gemini Robotics, RT-H) stayed inside Google. If you are evaluating "RT-X" as a base for your own system, you are really evaluating which RT-X checkpoint is available — usually just RT-1-X — against OpenVLA, which was published end-to-end with weights, code, and adapters.

For a broader view, see our VLA model directory, the RT-X page, or the OpenVLA page.

At-a-glance comparison

Dimension	RT-X family	OpenVLA
Parameters	RT-1-X ~35M · RT-2-X 55B (PaLI-X based)	7B
Backbone	RT-1-X: EfficientNet + token learner + transformer. RT-2-X: PaLI-X multimodal LLM	Llama-2 7B + DINOv2 + SigLIP
Action head	Discretized action tokens (both variants)	Discretized action tokens
Training data	Open X-Embodiment (22 embodiments, ~970K traj)	Open X-Embodiment (~970K traj)
Language conditioning	Strong on RT-2-X, moderate on RT-1-X	Strong (LLM-native)
Action space	End-effector delta, configurable dim	7-DOF end-effector delta (extensible)
Inference hardware	RT-1-X: single GPU. RT-2-X: multi-GPU inference cluster	Single A100/H100/L40S bf16
Weights available?	RT-1-X: yes. RT-2-X: no (proprietary)	Yes, on Hugging Face
License	RT-1-X: Apache 2.0. RT-2-X: closed	MIT
Fine-tuning	RT-1-X fine-tune supported. RT-2-X: no external access	LoRA, QLoRA, and full fine-tune recipes published
Paper	Open X-Embodiment paper (2023), RT-2 (arXiv:2307.15818)	Kim et al., CoRL 2024 (arXiv:2406.09246)
Code	github.com/google-deepmind/open_x_embodiment	github.com/openvla/openvla

RT-X: the family that started it all

RT-1-X — the available one

RT-1-X is the Open X-Embodiment re-training of Google's original RT-1 architecture: an EfficientNet visual backbone, a FiLM language conditioner, a token learner, and a transformer that outputs discretized actions. It is small (~35M parameters), fast, and — critically — the weights are on Hugging Face under Apache 2.0. If your use case is mobile manipulation with a Franka or Google Robot arm and you want a lightweight baseline, RT-1-X remains a credible starting point even in 2026.

RT-2-X — the one you cannot have

RT-2-X is a PaLI-X-based 55B VLA that Google trained on the same Open X-Embodiment data. It established the thesis that big VLMs can do robot control, but the weights have never been released. If you read "RT-X performance" in a blog post, check whether they mean the 35M RT-1-X or the 55B RT-2-X — the gap between the two is enormous. RT-2-X is a published reference, not a model you can deploy.

OpenVLA: the open answer

OpenVLA was built explicitly to close the RT-2-X gap for the open-source community. It is 7B parameters — large enough to capture language priors, small enough to serve on one high-end GPU. The OpenVLA paper reports that it outperforms RT-2-X on LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, and BridgeData V2 zero-shot, despite using 7× fewer parameters. It uses a Llama-2 7B LLM with DINOv2 + SigLIP vision encoders, discretizes actions into 256 bins per dimension, and is trained on the same Open X-Embodiment corpus RT-X saw.

More importantly for builders: OpenVLA ships with a LoRA fine-tuning recipe that lets a team adapt the policy to a new robot on a single 80 GB A100 in hours rather than days. That deployment path simply does not exist for RT-2-X.

The licensing and access question

This is usually the decisive factor. RT-1-X under Apache 2.0 is commercial-grade but old and small. RT-2-X is not licensable at all. OpenVLA under MIT is the only way to get RT-2-class behavior in a commercially usable package. Teams shipping real robots in 2026 — especially warehouse deployments or lab automation installs that need explicit rights — almost always land on OpenVLA or one of its derivatives.

Hardware footprint

RT-1-X runs on a single GPU at real-time rates and was demonstrated on a range of robots from Franka to Google Robot. It is the lightweight option. OpenVLA needs ~16 GB of VRAM in bfloat16 and typically runs at 5–10 Hz on an A100 without quantization — which is fine for most manipulation but tight for dexterous control. Teams often pair OpenVLA with action chunking and a lower-level impedance controller to hit the required loop rate. RT-2-X, if it were available, would need a multi-GPU inference server — another reason open-source deployments gravitated toward 7B-class models instead.

When RT-1-X still makes sense

You want the smallest, fastest Open-X baseline and do not need strong language understanding.
You are comparing cross-embodiment transfer in a research paper and want a historically grounded baseline.
You are running on edge hardware where a 7B model is not viable.

When OpenVLA is the obvious pick

You want RT-2-class language behavior with weights you can actually download.
You need a commercial license path (MIT).
You want an active community shipping LoRA adapters, quantized variants, and deployment recipes.
You plan to fine-tune on your own teleop data — OpenVLA's documentation is materially better than RT-X's for downstream use.

Honest tradeoffs

OpenVLA is not strictly "better" than RT-2-X — DeepMind has continued to push internal models well past RT-2-X, and Google's production robotics stack uses closed descendants. What OpenVLA is, unambiguously, is the best open-weight foundation model in the RT-X lineage. If you need SOTA-at-any-cost and can partner with DeepMind, that is a different conversation. If you are building a product, OpenVLA's MIT weights plus a Supabase-hosted data pipeline plus teleoperation data collection is a realistic stack today.

Benchmarks and evaluation

Both model families publish numbers on LIBERO, Google Robot, and RLBench. OpenVLA's LIBERO numbers are in the paper and reproducible. RT-X's numbers vary by variant — be careful which RT-X row you are quoting. See our benchmarks directory for current suites.

When reading a benchmark claim involving RT-X, always ask three questions: which RT-X variant (RT-1-X or RT-2-X), which embodiment subset of Open X-Embodiment was used for evaluation, and whether the robot was Google's real-world WidowX or an external reproduction. The numbers can swing 20 percentage points depending on these choices. OpenVLA evaluations are generally simpler to interpret because the paper is explicit about checkpoint, protocol, and dataset splits.

Fine-tuning and deployment recipes

Downstream teams working with RT-1-X typically fine-tune on narrower task sets within the Open X-Embodiment umbrella — selecting only the trajectories relevant to their target embodiment, then running a short training loop to bring the policy up on their specific hardware. The RT-1-X training code is mature but sparse on production guidance, and most of the downstream knowledge lives in papers rather than recipes.

OpenVLA's fine-tuning story is considerably richer. The official repository ships a LoRA fine-tune example, a full fine-tune example, a dataset conversion tool for custom RLDS data, and reference configurations for several popular robots. The community has added 4-bit quantization, vLLM serving, and integration with LeRobot. For a team that needs to move from "we have teleop data" to "we have a policy running on our robot" in under a month, OpenVLA's tooling is the decisive advantage.

What DeepMind is shipping next

Google DeepMind has continued to iterate past RT-2-X internally. Gemini Robotics and related projects extend the foundation-model-for-robotics thesis with larger VLMs and tighter coupling to production Google robots. None of these checkpoints have been made available for external deployment, so from a builder's perspective they are worth reading about but not worth building on. OpenVLA, pi0, and the LeRobot-ecosystem models are the realistic forward path for open-source work.

Our recommendation

For almost every practical team today, OpenVLA is the answer. It inherits the RT-X lineage intellectually, matches or beats RT-2-X on public benchmarks, ships with open weights under MIT, and has an active fine-tuning ecosystem. RT-1-X remains a fine tiny baseline, but you are more likely to use Octo or a small Diffusion Policy for that role in 2026. RT-2-X is a paper reference — cite it, do not build on it.

RT-X vs OpenVLA: Comparing Foundation Models for Robotics