Foundation Models for Robot Manipulation: RT-2, OpenVLA, Octo, and π0

What Makes a Robot Foundation Model

A robot foundation model is a large neural network pre-trained on diverse robot interaction data (and often web data) that can be fine-tuned to new tasks with relatively few demonstrations — analogous to how GPT-4 or CLIP can be fine-tuned for domain-specific NLP or vision tasks.

The critical property is transferability: the model must have learned generalizable representations of objects, scenes, and actions that apply to tasks it was not explicitly trained on. Pre-training on 100,000+ demonstrations across many environments and robot types is the primary mechanism for acquiring this generalizability.

Foundation models are not universally superior to task-specific policies. For a well-scoped, single-task deployment, a well-trained ACT or diffusion policy on 500–1,000 task-specific demonstrations will outperform a fine-tuned foundation model. Foundation models pay off when you need rapid adaptation to many new tasks (>10 new tasks per quarter) or when you have very few demonstrations for a novel task (5–50 shots).

RT-2: Robotic Transformer 2

RT-2 (Google DeepMind, 2023) is the most cited robot foundation model. It fine-tunes a large vision-language model (PaLI-X, 55B parameters) to output robot actions tokenized as text, training on a combination of web image-text data and robot trajectory data simultaneously.

Architecture: PaLI-X VLM backbone with action tokens appended to the vocabulary. Actions are discretized into 256 bins per dimension and decoded as text tokens.
Training data: ~130K robot episodes from RT-1 dataset (Google robot kitchen tasks) plus massive web text/image data. The web data is the key — it provides the semantic understanding of objects and scenes.
Key results: RT-2 achieves 62% success on novel objects and 55% on novel backgrounds with zero-shot generalization, vs. RT-1's 32% and 28% respectively. It can respond to natural language instructions like "pick up the object that can be used to put out a fire."
Inference cost: Running 55B parameters requires a multi-GPU server. Not deployable on edge hardware. Inference latency 1–3 seconds — acceptable only for slower robot control loops.
Limitations: Closed weights (Google internal). Not directly fine-tunable by external teams. Successor SayCan2 extends to multi-step task planning.

OpenVLA

OpenVLA (Stanford + UC Berkeley, 2024) is the leading open-weight robot foundation model as of 2025. It is based on a 7B parameter LLaMA-2 language model with a DINOv2 + SigLIP vision encoder, fine-tuned on the Open X-Embodiment dataset.

Architecture: Prismatic VLM (LLaMA-2 7B + DINOv2 vision encoder). Action output via tokenized discretization (256 bins). Input: 512×512 image + natural language instruction.
Training data: Open X-Embodiment dataset — over 1 million robot episodes from 22 different robot types across 22 research institutions. 29 distinct robot embodiments represented.
Open weights: Model weights published on HuggingFace (openvla/openvla-7b). Fine-tunable with standard LoRA or full fine-tuning on consumer hardware.
Fine-tuning cost: LoRA fine-tuning on 1,000 task-specific demonstrations takes approximately 4–8 hours on a single A100 GPU (~$500–$800 in cloud compute). Full fine-tuning requires 4–8× A100s.
Inference speed: ~500 ms per action on A100 (7B params). Quantized INT4 version runs at ~100–200 ms. Not suitable for >2 Hz control without optimization.
Performance: Matches or exceeds RT-2 on BridgeV2 benchmark tasks. Achieves 56–65% success on unseen tasks with zero additional fine-tuning.

Octo

Octo (Berkeley, 2024) takes a different architectural approach: a smaller transformer model with a diffusion policy action head, trained entirely on robot data (no web data). This makes it faster to fine-tune and deploy.

Architecture: Transformer observation encoder (images + proprioception + language) + diffusion action head. Total parameters ~93M — dramatically smaller than VLM-based approaches.
Training data: 800,000 robot demonstrations from Open X-Embodiment (filtered for quality). 9 robot types.
Inference speed: ~30–100 ms per action on a standard GPU. Deployable on NVIDIA Jetson AGX Orin (275 TOPS) for edge inference.
Fine-tuning: 1,000 demonstrations fine-tuned in <1 hour on a single A100. Much faster iteration cycle than VLM-based models.
Limitations: Does not understand natural language semantics as well as VLM-based models. Worse at zero-shot novel object generalization. Better for scenarios where you have task-specific fine-tuning data.

π0 (Physical Intelligence)

π0 (Physical Intelligence, 2024) is the most capable reported robot foundation model for dexterous bimanual manipulation. It uses a flow matching action head (a continuous normalizing flow variant) which avoids some of the discretization artifacts of tokenized action outputs.

Architecture: VLM backbone (PaLI-Gemma) + flow matching action head. Trained on proprietary data from Physical Intelligence's robot fleet.
Capabilities: Demonstrated on laundry folding, table busing, making sandwiches — high-dexterity bimanual tasks with deformable objects that other models have not matched.
Availability: Not open-weight. Available through Physical Intelligence's commercial API (limited access, enterprise pricing). Not self-hostable.
Flow matching advantage: Flow matching avoids the discretization bins required by tokenized action outputs, enabling smoother and more precise actions, particularly important for high-DOF hand and bimanual tasks.

Capability Comparison

Model	Params	Weights	Zero-Shot	Fine-Tune Cost	Inference Speed	Best For
RT-2	55B	Closed	Excellent	N/A	1–3 s	Google internal
OpenVLA	7B	Open	Good	$500–$800	100–500 ms	Open research, fine-tuning
Octo	93M	Open	Moderate	<$100	30–100 ms	Edge deployment, fast iteration
π0	~3B	Closed API	Excellent	Custom	Unknown	Dexterous bimanual
Task-specific ACT	~300M	N/A	None	$50–$200	15–30 ms	Single well-scoped task

Deployment Considerations

Quantization: OpenVLA INT8 quantization reduces GPU memory from 14 GB to ~7 GB with <5% performance degradation. INT4 (4-bit) reduces to ~3.5 GB with 8–12% degradation. Use bitsandbytes or GPTQ for quantization.
Edge inference: Octo (93M params) is the only current foundation model that runs comfortably on NVIDIA Jetson AGX Orin. OpenVLA and π0 require a server. For edge deployment of larger models, consider running inference on a co-located GPU server connected to the robot over Ethernet.
Fine-tuning data: Even foundation models benefit substantially from 200–1,000 demonstrations on the specific target task and robot. Cross-embodiment transfer is imperfect — a model trained on WidowX data will transfer, but not perfectly, to a Franka arm.

The SVRC training platform supports OpenVLA and Octo fine-tuning with a one-click interface. Bring your collected demonstrations and we handle compute, hyperparameter search, and evaluation against standard benchmarks.