Step-by-step guide to fine-tuning OpenVLA or pi0 on your teleoperation dataset — data prep, training configuration, evaluation, quantization, and deployment on real hardware.
You will take a pre-trained VLA model (OpenVLA-7B), fine-tune it on your robot's demonstration data, evaluate it with rollout experiments, quantize it for real-time inference, and deploy it as a ROS2 action server on your robot. Expect 60–80% success rate on in-distribution tasks on your first run, improving with iterative data collection.
VLA Fine-Tuning Pipeline
RLDS Dataset 300+ episodes
→
OpenVLA / pi0 Pre-trained 7B
→
Fine-Tune LoRA or Full
→
Evaluate Rollout tests
→
Deploy ROS2 + INT4
1
Check Prerequisites and GPU Setup
Verify your GPU, CUDA version, and available VRAM before starting. Fine-tuning requires specific hardware depending on the method.
Method
Min VRAM
GPU
Cost Estimate
LoRA fine-tune
24 GB
RTX 4090, A5000
$50–150
Full fine-tune
40 GB
A100 40GB
$150–300
Full fine-tune (multi-GPU)
2x 40 GB
2x A100
$250–400
# Verify CUDA and GPUnvidia-smipython3 -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0)}, VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f}GB')"# Verify CUDA version (need 12.1+)nvcc --version
Budget tip: Use Vast.ai or Lambda Labs for cloud GPU rental. An A100 80GB costs $1.50–2.50/hr. A typical LoRA fine-tuning run takes 6–12 hours, so budget $10–30 for the GPU alone.
2
Install OpenVLA Dependencies
Clone OpenVLA and install all required packages. We recommend a clean virtual environment.
OpenVLA expects data in a specific format. Convert your RLDS dataset or LeRobot dataset to the OpenVLA training format.
# Convert your RLDS dataset to OpenVLA formatpython3 scripts/convert_rlds_to_openvla.py \
--input-dir=~/datasets/openarm_pick_place_rlds \
--output-dir=~/datasets/openarm_openvla \
--image-key="observation/image" \
--state-key="observation/state" \
--action-key="action" \
--language-key="language_instruction"# Verify the converted datasetpython3 -c "
import json, os
meta = json.load(open(os.path.expanduser('~/datasets/openarm_openvla/metadata.json')))
print(f'Episodes: {meta[\"num_episodes\"]}')
print(f'Total steps: {meta[\"total_steps\"]}')
print(f'Action dim: {meta[\"action_dim\"]}')
print(f'Task: {meta[\"task_description\"]}')
"
Data format check: OpenVLA expects images as 224x224 RGB (it resizes internally), actions as 7D continuous vectors (6 DOF + gripper), and a text language instruction per episode. Verify these dimensions before training — shape mismatches are the most common training failure.
4
Configure Training
Create a training configuration YAML. These hyperparameters are tuned for a typical single-task fine-tuning run.
# Save as configs/finetune_openarm.yamlcat > configs/finetune_openarm.yaml << 'EOF'
# OpenVLA Fine-Tuning Config — OpenArm Pick & Place
model:
pretrained_checkpoint: checkpoints/openvla-7b
use_lora: true # Set false for full fine-tune
lora_rank: 32
lora_alpha: 64
lora_dropout: 0.05
data:
dataset_path: ~/datasets/openarm_openvla
task_description: "Pick up the red block and place it on the green target zone"
image_size: 224
action_dim: 7 # 6 joints + 1 gripper
train_split: 0.9
shuffle: true
num_workers: 4
training:
learning_rate: 2.0e-5 # Standard for LoRA VLA fine-tuning
weight_decay: 0.01
warmup_steps: 100
max_steps: 5000 # ~10 epochs over 500 episodes
batch_size: 8 # Reduce to 4 if OOM on 24GB
gradient_accumulation_steps: 4
fp16: true
save_steps: 500
eval_steps: 250
logging_steps: 10
wandb:
project: "openvla-finetune"
run_name: "openarm-pick-place-lora-r32"
EOF
Hyperparameter guidance: Learning rate 2e-5 is a safe default for LoRA. For full fine-tuning, use 1e-5. If training loss does not decrease after 500 steps, try 5e-5. Batch size 8 with gradient accumulation 4 gives an effective batch size of 32, which works well for 300–500 episode datasets.
5
Launch Fine-Tuning
Start the training run. For a single GPU use python3; for multi-GPU use torchrun.
Watch the loss curve and gradient norms in Weights & Biases. Here is what to look for:
Loss should decrease steadily for the first 1,000–2,000 steps, then plateau
Gradient norm should stay between 0.1 and 10.0 — spikes above 50 indicate instability
Checkpoints are saved every 500 steps — you can resume from any checkpoint if training crashes
# Open wandb dashboard in browserwandb open
# Or check training logs directlytail -f runs/openarm-pick-place-v1/training.log
# List saved checkpointsls -la runs/openarm-pick-place-v1/checkpoints/
If loss plateaus above 1.0: Your dataset may have quality issues. Check for mislabeled episodes, inconsistent action scales, or corrupt images. Try increasing the learning rate to 5e-5 or reducing batch size to 4.
7
Evaluate Checkpoint
Run rollout evaluation on your robot (or in simulation) to measure the fine-tuned model's success rate.
# Evaluate the best checkpoint (lowest val loss)python3 scripts/evaluate.py \
--checkpoint=runs/openarm-pick-place-v1/checkpoints/step-4500 \
--robot-type=openarm \
--robot-port=/dev/ttyUSB0 \
--num-rollouts=20 \
--task="Pick up the red block and place it on the green target zone"# Expected output:
# Rollout 1/20: SUCCESS (14.2s)
# Rollout 2/20: SUCCESS (12.8s)
# Rollout 3/20: FAILURE — dropped object at step 45
# ...
# Success rate: 14/20 (70.0%)
# Avg completion time: 13.5s
Success rate benchmarks: 60–80% on your first fine-tuning run is a strong result. Below 50% suggests data quality issues or too few episodes. Above 80% means your model is ready for deployment with iterative improvements.
8
Quantize for Deployment
Full-precision VLA models run at 2–5 Hz, which is too slow for real-time control. INT4 quantization gets you to 10–25 Hz.
# Quantize to INT4 using bitsandbytespython3 scripts/quantize.py \
--checkpoint=runs/openarm-pick-place-v1/checkpoints/step-4500 \
--quantization=int4 \
--output-dir=runs/openarm-pick-place-v1/quantized
# Benchmark inference speedpython3 scripts/benchmark_inference.py \
--checkpoint=runs/openarm-pick-place-v1/quantized \
--num-steps=100
# Expected output:
# Model size: 4.2 GB (down from 14.8 GB)
# Inference speed: 18.3 Hz (±1.2 Hz)
# Latency: 54.6 ms per step
Precision
Model Size
Inference Hz
Min GPU
FP16 (baseline)
14.8 GB
3–5 Hz
24 GB VRAM
INT8
7.4 GB
8–15 Hz
12 GB VRAM
INT4
4.2 GB
15–25 Hz
8 GB VRAM
9
Deploy on Robot via ROS2
Create a ROS2 action server that runs the quantized VLA model and sends commands to the robot at 10–20 Hz.
# Create a ROS2 package for the VLA inference nodecd ~/ros2_ws/src
ros2 pkg create --build-type ament_python vla_inference \
--dependencies rclpy sensor_msgs std_msgs
# Copy the inference script (simplified version below)cat > ~/ros2_ws/src/vla_inference/vla_inference/inference_node.py << 'PYEOF'
import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image, JointState
from std_msgs.msg import String
import numpy as np
import torch
from openvla import OpenVLAModel
class VLAInferenceNode(Node):
def __init__(self):
super().__init__("vla_inference")
self.model = OpenVLAModel.from_pretrained(
"runs/openarm-pick-place-v1/quantized",
torch_dtype=torch.float16
)
self.model.eval()
self.image_sub = self.create_subscription(
Image, "/wrist_camera/image_raw", self.image_cb, 10)
self.action_pub = self.create_publisher(
JointState, "/target_joint_positions", 10)
self.task = "Pick up the red block and place it on the green target zone"
self.latest_image = None
self.timer = self.create_timer(0.066, self.inference_loop) # 15 Hz
self.get_logger().info("VLA inference node started")
def image_cb(self, msg):
self.latest_image = np.frombuffer(
msg.data, dtype=np.uint8).reshape(msg.height, msg.width, 3)
def inference_loop(self):
if self.latest_image is None:
return
action = self.model.predict(self.latest_image, self.task)
msg = JointState()
msg.position = action[:6].tolist()
msg.effort = [action[6]] # gripper
self.action_pub.publish(msg)
def main():
rclpy.init()
node = VLAInferenceNode()
rclpy.spin(node)
rclpy.shutdown()
PYEOF
# Build and run the inference nodecd ~/ros2_ws
colcon build --packages-select vla_inference
source install/setup.bash
# Launch robot + cameras + VLA inferenceros2 launch openarm_bringup openarm.launch.py &
ros2 run vla_inference inference_node
10
Iterate: Collect More Data, Retrain
Your first model will have failure modes. The most effective way to improve is to collect targeted demonstrations on the failure cases and retrain.
# 1. Identify failure modes from evaluation rollouts# Common failures: objects at edge positions, unusual orientations,# lighting changes, objects the model hasn't seen# 2. Collect 50-100 targeted demonstrations for failure caseslerobot record \
--robot-type=openarm \
--task="pick_red_block_edge_positions" \
--num-episodes=50 \
--output-dir=~/datasets/openarm_pick_place_v2
# 3. Merge with original datasetpython3 scripts/merge_datasets.py \
--datasets ~/datasets/openarm_pick_place ~/datasets/openarm_pick_place_v2 \
--output ~/datasets/openarm_pick_place_merged
# 4. Fine-tune again (from the best previous checkpoint)python3 scripts/finetune.py \
--config=configs/finetune_openarm.yaml \
--resume-from=runs/openarm-pick-place-v1/checkpoints/step-4500 \
--output-dir=runs/openarm-pick-place-v2 \
--data.dataset_path=~/datasets/openarm_pick_place_merged
Iteration loop
The DAgger (Dataset Aggregation) loop is proven to improve policy performance with each cycle: Deploy → Identify failures → Collect targeted demos → Retrain → Evaluate. Most teams see 5–15% success rate improvement per iteration, reaching 85–95% within 3–4 cycles.
Troubleshooting
CUDA Out of Memory (OOM) during training
Reduce batch_size to 4 or 2. Enable gradient checkpointing by adding gradient_checkpointing: true to your config. For LoRA, reduce lora_rank from 32 to 16. As a last resort, use CPU offloading with DeepSpeed ZeRO-3.
Training loss is NaN
This usually means a learning rate that is too high or corrupt data. Reduce learning rate to 1e-5. Check for NaN values in your dataset with np.isnan(actions).any(). Enable fp32 instead of fp16 to debug numerical issues.
Model produces constant/identical actions
The model may have collapsed to predicting the mean action. Verify your action space is normalized (zero mean, unit variance). Check that the task description matches what was used during data collection. Try increasing training steps or learning rate.
Inference is too slow for real-time control
Apply INT4 quantization (step 8). Use torch.compile() for 20–30% speedup. Consider running inference on a separate machine and streaming actions over the network. Minimum target is 10 Hz for arm manipulation.
Quantized model has much lower success rate
Some quality loss is expected (1–5% drop). If the drop is more than 10%, try INT8 instead of INT4. Use calibration data during quantization: --calibration-data=~/datasets/openarm_openvla. AWQ quantization often preserves more quality than GPTQ for VLA models.
A typical fine-tuning run costs $150–400 in GPU compute on cloud providers like Lambda Labs or Vast.ai. This assumes 1–2 A100 GPUs for 12–48 hours. Using LoRA reduces this to $50–150 by requiring less memory and fewer training steps.
OpenVLA is an open-source 7B parameter Vision-Language-Action model from Stanford that takes camera images and language instructions as input and outputs robot actions. pi0 (from Physical Intelligence) is a flow-matching based VLA that excels at dexterous manipulation. OpenVLA is easier to fine-tune; pi0 often achieves higher success rates on complex tasks.
For a single task, 300–500 high-quality demonstrations are a good starting point. More complex tasks or multi-task fine-tuning may need 800–1,200+ episodes. Quality matters more than quantity — 300 clean demonstrations outperform 1,000 noisy ones.
Yes, using LoRA (Low-Rank Adaptation) you can fine-tune OpenVLA on an RTX 4090 (24 GB VRAM) or even an RTX 3090. Full fine-tuning requires 40–80 GB VRAM (A100 or multi-GPU setup). LoRA typically achieves 85–95% of full fine-tuning performance.
For in-distribution tasks (same objects, similar positions as training data), expect 60–80% success rate on your first fine-tuning run. With iterative data collection and retraining (DAgger-style), you can push this to 85–95%. Out-of-distribution generalization varies widely.
Was this tutorial helpful?
Stay Ahead in Robotics
Get the latest on robot deployments, data collection, and physical AI — delivered to your inbox.