Fine-Tuning Robot Foundation Models: A Practical Guide

Why Fine-Tune Rather Than Train from Scratch

Training a manipulation policy from scratch requires thousands of task-specific demonstrations. A well-trained foundation model changes this equation dramatically.

A robot foundation model pre-trained on diverse internet-scale visual data and thousands of robot demonstrations already understands: visual semantics ("red cup" vs. "blue mug"), object affordances (cups are graspable by the rim or body), language grounding ("pick up" vs. "push"), and basic action priors (smooth trajectories, approach angles).

Fine-tuning transfers these priors to your specific task. The result is 3–10× sample efficiency: where training from scratch might require 5,000 demonstrations, fine-tuning a foundation model often achieves comparable performance with 200–500. This difference means weeks of data collection vs. days.

Foundation Model Options

OpenVLA (7B parameters, LLaMA backbone, fully open weights) — the most accessible foundation model for fine-tuning as of 2025. Strong community, well-documented fine-tuning recipes, active HuggingFace page. Language-conditioned: accepts free-text instructions. Best generalization across visual domains. Inference: 6 Hz on A100, 2 Hz on RTX 4090.
Octo (90M transformer + diffusion action head, fully open weights) — faster inference (25 Hz on single A100) and lower fine-tuning cost than OpenVLA. Diffusion head generates smooth trajectories. Less visual generalization than OpenVLA but excellent for within-distribution tasks. Published at CoRL 2024.
π0 (Physical Intelligence, flow-matching architecture, restricted access) — state-of-the-art on dexterous manipulation benchmarks as of early 2025. Flow matching produces high-quality trajectories for contact-rich tasks. Access through Physical Intelligence partnership program; not fully open as of this writing. Best choice if dexterous manipulation (folding, assembly) is the target.
RoboFlamingo / 3D-VLA — emerging alternatives with strong 3D spatial reasoning. Watch these for tasks requiring precise 3D placement.

Data Requirements

Minimum demonstration counts for fine-tuning vs. training from scratch:

OpenVLA fine-tuning: 50–500 demonstrations depending on task complexity. Simple pick-and-place: 50 demos often sufficient. Complex multi-step assembly: 300–500. Collect with wrist-mounted camera + external view.
Octo fine-tuning: 200–1,000 demonstrations. Slightly more data-hungry than OpenVLA for the same task, but faster to train. Particularly efficient when your task is similar to BridgeData (tabletop manipulation).
Training from scratch (ACT, Diffusion Policy): 5,000–50,000 demonstrations depending on task. ALOHA paper used 50 demos for simple tasks, 200 for complex bimanual; but these are strong in-distribution models with no generalization.
Data quality matters more than quantity. 100 expert demonstrations (smooth, consistent, no hesitation) outperform 500 mediocre demonstrations. Use a teleoperation system with haptic feedback for highest-quality data collection.

Training Setup

Practical compute requirements for fine-tuning:

OpenVLA full fine-tuning: 4× A100 80GB, approximately 8 hours for 500 demonstrations, 10,000 gradient steps. Total cloud cost: ~$200–400 on Lambda Cloud or RunPod.
OpenVLA with LoRA: Rank-16 LoRA reduces memory to fit on a single A100 40GB with minimal performance loss (typically <5% success rate difference vs. full fine-tuning). Training time: 3–4 hours.
Octo fine-tuning: 1× A100 40GB, 2–4 hours for 200–500 demonstrations. Most cost-effective option for moderate tasks.
Use HuggingFace `transformers` + `peft` for OpenVLA LoRA. The `lerobot` library provides Octo fine-tuning scripts. Validate on a held-out 20% of your demonstrations before running hardware evaluation.

Expected Results

Model	Demos Required	Expected Success Rate (Fine-Tune)	Training from Scratch (Same Demos)
OpenVLA	100	55–70%	15–30%
OpenVLA	300	70–85%	35–55%
Octo	200	60–75%	20–40%
Octo	500	75–88%	45–65%
π0 (if available)	100	65–80%	N/A (requires much more)

Numbers are approximate and task-dependent. Pick-and-place tasks with clear visual distinction are at the high end. Dexterous manipulation (cloth, deformable objects) is at the low end. Always validate on held-out objects.

Common Failure Modes

Catastrophic forgetting — fine-tuning overwrites pre-trained features, degrading generalization. Solution: use LoRA (freezes most weights) or add a small fraction of pre-training data to your fine-tuning mix.
Mode collapse — policy converges to a single behavior (e.g., always approaches from the same angle) ignoring object variety. Solution: increase dataset diversity, add augmentation (random crops, color jitter), ensure training objects vary in pose and orientation.
Overfitting — policy memorizes training demonstrations. Signs: training success rate >>80%, but novel-object success <30%. Solution: validate on held-out objects during training, use early stopping, add dropout.
Action distribution mismatch — your robot's action space (joint angles, end-effector pose) does not match the foundation model's expected format. Carefully map your robot's action space to the model's expected interface before fine-tuning.

Deployment

After fine-tuning, optimize for inference speed on your deployment hardware:

ONNX export — export OpenVLA or Octo to ONNX for hardware-agnostic deployment. Reduces Python overhead, enables edge deployment.
TensorRT for Jetson — convert ONNX to TensorRT engine for NVIDIA Jetson AGX Orin. OpenVLA LoRA model runs at 4–6 Hz on Jetson AGX Orin with INT8 quantization.
Quantization — INT8 quantization reduces model size by 4× with <3% success rate drop for most manipulation tasks. Use `bitsandbytes` for quick quantization.

Access fine-tuning infrastructure and collected datasets via the SVRC platform, or explore our data collection services for high-quality demonstration collection.