How to Fine-Tune a VLA Model for Your Robot

Step-by-step guide to fine-tuning OpenVLA or pi0 on your teleoperation dataset — data prep, training configuration, evaluation, quantization, and deployment on real hardware.

Advanced ⏱ 1–2 days Updated April 2026

1. Prerequisites 2. Install 3. Dataset 4. Config 5. Train 6. Monitor 7. Evaluate 8. Quantize 9. Deploy 10. Iterate

Prerequisites

A teleoperation dataset with 300+ high-quality episodes in RLDS format (see data collection tutorial)
NVIDIA GPU with 24 GB+ VRAM (RTX 4090 for LoRA, A100/H100 for full fine-tuning)
Python 3.10+ with CUDA 12.1+
Familiarity with PyTorch and HuggingFace Transformers
Weights & Biases account (for logging, free tier works)
ROS2 installed on your robot machine (see setup guide)

What you will build

You will take a pre-trained VLA model (OpenVLA-7B), fine-tune it on your robot's demonstration data, evaluate it with rollout experiments, quantize it for real-time inference, and deploy it as a ROS2 action server on your robot. Expect 60–80% success rate on in-distribution tasks on your first run, improving with iterative data collection.

VLA Fine-Tuning Pipeline

RLDS Dataset
300+ episodes

→

OpenVLA / pi0
Pre-trained 7B

→

Fine-Tune
LoRA or Full

→

Evaluate
Rollout tests

→

Deploy
ROS2 + INT4

Check Prerequisites and GPU Setup

Verify your GPU, CUDA version, and available VRAM before starting. Fine-tuning requires specific hardware depending on the method.

Method	Min VRAM	GPU	Cost Estimate
LoRA fine-tune	24 GB	RTX 4090, A5000	$50–150
Full fine-tune	40 GB	A100 40GB	$150–300
Full fine-tune (multi-GPU)	2x 40 GB	2x A100	$250–400

# Verify CUDA and GPU nvidia-smi python3 -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0)}, VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f}GB')" # Verify CUDA version (need 12.1+) nvcc --version

Budget tip: Use Vast.ai or Lambda Labs for cloud GPU rental. An A100 80GB costs $1.50–2.50/hr. A typical LoRA fine-tuning run takes 6–12 hours, so budget $10–30 for the GPU alone.

Install OpenVLA Dependencies

Clone OpenVLA and install all required packages. We recommend a clean virtual environment.

# Create virtual environment python3 -m venv ~/vla_env source ~/vla_env/bin/activate # Clone OpenVLA git clone https://github.com/openvla/openvla.git cd openvla # Install dependencies pip install -e ".[all]" # Install additional training dependencies pip install wandb accelerate peft bitsandbytes pip install flash-attn --no-build-isolation # Login to wandb for experiment tracking wandb login # Download pre-trained OpenVLA checkpoint python3 -c "from huggingface_hub import snapshot_download; snapshot_download('openvla/openvla-7b', local_dir='checkpoints/openvla-7b')"

Prepare Your Dataset

OpenVLA expects data in a specific format. Convert your RLDS dataset or LeRobot dataset to the OpenVLA training format.

# Convert your RLDS dataset to OpenVLA format python3 scripts/convert_rlds_to_openvla.py \ --input-dir=~/datasets/openarm_pick_place_rlds \ --output-dir=~/datasets/openarm_openvla \ --image-key="observation/image" \ --state-key="observation/state" \ --action-key="action" \ --language-key="language_instruction" # Verify the converted dataset python3 -c " import json, os meta = json.load(open(os.path.expanduser('~/datasets/openarm_openvla/metadata.json'))) print(f'Episodes: {meta[\"num_episodes\"]}') print(f'Total steps: {meta[\"total_steps\"]}') print(f'Action dim: {meta[\"action_dim\"]}') print(f'Task: {meta[\"task_description\"]}') "

Data format check: OpenVLA expects images as 224x224 RGB (it resizes internally), actions as 7D continuous vectors (6 DOF + gripper), and a text language instruction per episode. Verify these dimensions before training — shape mismatches are the most common training failure.

Configure Training

Create a training configuration YAML. These hyperparameters are tuned for a typical single-task fine-tuning run.

# Save as configs/finetune_openarm.yaml cat > configs/finetune_openarm.yaml << 'EOF' # OpenVLA Fine-Tuning Config — OpenArm Pick & Place model: pretrained_checkpoint: checkpoints/openvla-7b use_lora: true # Set false for full fine-tune lora_rank: 32 lora_alpha: 64 lora_dropout: 0.05 data: dataset_path: ~/datasets/openarm_openvla task_description: "Pick up the red block and place it on the green target zone" image_size: 224 action_dim: 7 # 6 joints + 1 gripper train_split: 0.9 shuffle: true num_workers: 4 training: learning_rate: 2.0e-5 # Standard for LoRA VLA fine-tuning weight_decay: 0.01 warmup_steps: 100 max_steps: 5000 # ~10 epochs over 500 episodes batch_size: 8 # Reduce to 4 if OOM on 24GB gradient_accumulation_steps: 4 fp16: true save_steps: 500 eval_steps: 250 logging_steps: 10 wandb: project: "openvla-finetune" run_name: "openarm-pick-place-lora-r32" EOF

Hyperparameter guidance: Learning rate 2e-5 is a safe default for LoRA. For full fine-tuning, use 1e-5. If training loss does not decrease after 500 steps, try 5e-5. Batch size 8 with gradient accumulation 4 gives an effective batch size of 32, which works well for 300–500 episode datasets.

Launch Fine-Tuning

Start the training run. For a single GPU use python3; for multi-GPU use torchrun.

# Single GPU (RTX 4090 / A100) python3 scripts/finetune.py \ --config=configs/finetune_openarm.yaml \ --output-dir=runs/openarm-pick-place-v1 # Multi-GPU (2x A100) torchrun --nproc-per-node=2 scripts/finetune.py \ --config=configs/finetune_openarm.yaml \ --output-dir=runs/openarm-pick-place-v1 # Expected output: # Step 10/5000 | Loss: 2.34 | LR: 2.0e-6 | Time: 1.2s/step # Step 100/5000 | Loss: 0.87 | LR: 2.0e-5 | Time: 1.1s/step # Step 500/5000 | Loss: 0.42 | Checkpoint saved. # ...

Training time estimates:

Setup	Steps/sec	Time for 5K steps	GPU Cost
RTX 4090 (LoRA)	~0.8	~1.7 hours	$5–10
A100 40GB (LoRA)	~1.5	~55 min	$2–4
A100 80GB (Full)	~0.6	~2.3 hours	$5–8
2x A100 (Full)	~1.0	~1.4 hours	$6–10

Monitor Training

Watch the loss curve and gradient norms in Weights & Biases. Here is what to look for:

Loss should decrease steadily for the first 1,000–2,000 steps, then plateau
Gradient norm should stay between 0.1 and 10.0 — spikes above 50 indicate instability
Checkpoints are saved every 500 steps — you can resume from any checkpoint if training crashes

# Open wandb dashboard in browser wandb open # Or check training logs directly tail -f runs/openarm-pick-place-v1/training.log # List saved checkpoints ls -la runs/openarm-pick-place-v1/checkpoints/

If loss plateaus above 1.0: Your dataset may have quality issues. Check for mislabeled episodes, inconsistent action scales, or corrupt images. Try increasing the learning rate to 5e-5 or reducing batch size to 4.

Evaluate Checkpoint

Run rollout evaluation on your robot (or in simulation) to measure the fine-tuned model's success rate.

# Evaluate the best checkpoint (lowest val loss) python3 scripts/evaluate.py \ --checkpoint=runs/openarm-pick-place-v1/checkpoints/step-4500 \ --robot-type=openarm \ --robot-port=/dev/ttyUSB0 \ --num-rollouts=20 \ --task="Pick up the red block and place it on the green target zone" # Expected output: # Rollout 1/20: SUCCESS (14.2s) # Rollout 2/20: SUCCESS (12.8s) # Rollout 3/20: FAILURE — dropped object at step 45 # ... # Success rate: 14/20 (70.0%) # Avg completion time: 13.5s

Success rate benchmarks: 60–80% on your first fine-tuning run is a strong result. Below 50% suggests data quality issues or too few episodes. Above 80% means your model is ready for deployment with iterative improvements.

Quantize for Deployment

Full-precision VLA models run at 2–5 Hz, which is too slow for real-time control. INT4 quantization gets you to 10–25 Hz.

# Quantize to INT4 using bitsandbytes python3 scripts/quantize.py \ --checkpoint=runs/openarm-pick-place-v1/checkpoints/step-4500 \ --quantization=int4 \ --output-dir=runs/openarm-pick-place-v1/quantized # Benchmark inference speed python3 scripts/benchmark_inference.py \ --checkpoint=runs/openarm-pick-place-v1/quantized \ --num-steps=100 # Expected output: # Model size: 4.2 GB (down from 14.8 GB) # Inference speed: 18.3 Hz (±1.2 Hz) # Latency: 54.6 ms per step

Precision	Model Size	Inference Hz	Min GPU
FP16 (baseline)	14.8 GB	3–5 Hz	24 GB VRAM
INT8	7.4 GB	8–15 Hz	12 GB VRAM
INT4	4.2 GB	15–25 Hz	8 GB VRAM

Deploy on Robot via ROS2

Create a ROS2 action server that runs the quantized VLA model and sends commands to the robot at 10–20 Hz.

# Create a ROS2 package for the VLA inference node cd ~/ros2_ws/src ros2 pkg create --build-type ament_python vla_inference \ --dependencies rclpy sensor_msgs std_msgs # Copy the inference script (simplified version below) cat > ~/ros2_ws/src/vla_inference/vla_inference/inference_node.py << 'PYEOF' import rclpy from rclpy.node import Node from sensor_msgs.msg import Image, JointState from std_msgs.msg import String import numpy as np import torch from openvla import OpenVLAModel class VLAInferenceNode(Node): def __init__(self): super().__init__("vla_inference") self.model = OpenVLAModel.from_pretrained( "runs/openarm-pick-place-v1/quantized", torch_dtype=torch.float16 ) self.model.eval() self.image_sub = self.create_subscription( Image, "/wrist_camera/image_raw", self.image_cb, 10) self.action_pub = self.create_publisher( JointState, "/target_joint_positions", 10) self.task = "Pick up the red block and place it on the green target zone" self.latest_image = None self.timer = self.create_timer(0.066, self.inference_loop) # 15 Hz self.get_logger().info("VLA inference node started") def image_cb(self, msg): self.latest_image = np.frombuffer( msg.data, dtype=np.uint8).reshape(msg.height, msg.width, 3) def inference_loop(self): if self.latest_image is None: return action = self.model.predict(self.latest_image, self.task) msg = JointState() msg.position = action[:6].tolist() msg.effort = [action[6]] # gripper self.action_pub.publish(msg) def main(): rclpy.init() node = VLAInferenceNode() rclpy.spin(node) rclpy.shutdown() PYEOF

# Build and run the inference node cd ~/ros2_ws colcon build --packages-select vla_inference source install/setup.bash # Launch robot + cameras + VLA inference ros2 launch openarm_bringup openarm.launch.py & ros2 run vla_inference inference_node

Iterate: Collect More Data, Retrain

Your first model will have failure modes. The most effective way to improve is to collect targeted demonstrations on the failure cases and retrain.

# 1. Identify failure modes from evaluation rollouts # Common failures: objects at edge positions, unusual orientations, # lighting changes, objects the model hasn't seen # 2. Collect 50-100 targeted demonstrations for failure cases lerobot record \ --robot-type=openarm \ --task="pick_red_block_edge_positions" \ --num-episodes=50 \ --output-dir=~/datasets/openarm_pick_place_v2 # 3. Merge with original dataset python3 scripts/merge_datasets.py \ --datasets ~/datasets/openarm_pick_place ~/datasets/openarm_pick_place_v2 \ --output ~/datasets/openarm_pick_place_merged # 4. Fine-tune again (from the best previous checkpoint) python3 scripts/finetune.py \ --config=configs/finetune_openarm.yaml \ --resume-from=runs/openarm-pick-place-v1/checkpoints/step-4500 \ --output-dir=runs/openarm-pick-place-v2 \ --data.dataset_path=~/datasets/openarm_pick_place_merged

Iteration loop

The DAgger (Dataset Aggregation) loop is proven to improve policy performance with each cycle: Deploy → Identify failures → Collect targeted demos → Retrain → Evaluate. Most teams see 5–15% success rate improvement per iteration, reaching 85–95% within 3–4 cycles.

Troubleshooting

CUDA Out of Memory (OOM) during training

Reduce batch_size to 4 or 2. Enable gradient checkpointing by adding gradient_checkpointing: true to your config. For LoRA, reduce lora_rank from 32 to 16. As a last resort, use CPU offloading with DeepSpeed ZeRO-3.

Training loss is NaN

This usually means a learning rate that is too high or corrupt data. Reduce learning rate to 1e-5. Check for NaN values in your dataset with np.isnan(actions).any(). Enable fp32 instead of fp16 to debug numerical issues.

Model produces constant/identical actions

The model may have collapsed to predicting the mean action. Verify your action space is normalized (zero mean, unit variance). Check that the task description matches what was used during data collection. Try increasing training steps or learning rate.

Inference is too slow for real-time control

Apply INT4 quantization (step 8). Use torch.compile() for 20–30% speedup. Consider running inference on a separate machine and streaming actions over the network. Minimum target is 10 Hz for arm manipulation.

Quantized model has much lower success rate

Some quality loss is expected (1–5% drop). If the drop is more than 10%, try INT8 instead of INT4. Use calibration data during quantization: --calibration-data=~/datasets/openarm_openvla. AWQ quantization often preserves more quality than GPTQ for VLA models.

Frequently Asked Questions

A typical fine-tuning run costs $150–400 in GPU compute on cloud providers like Lambda Labs or Vast.ai. This assumes 1–2 A100 GPUs for 12–48 hours. Using LoRA reduces this to $50–150 by requiring less memory and fewer training steps.

OpenVLA is an open-source 7B parameter Vision-Language-Action model from Stanford that takes camera images and language instructions as input and outputs robot actions. pi0 (from Physical Intelligence) is a flow-matching based VLA that excels at dexterous manipulation. OpenVLA is easier to fine-tune; pi0 often achieves higher success rates on complex tasks.

For a single task, 300–500 high-quality demonstrations are a good starting point. More complex tasks or multi-task fine-tuning may need 800–1,200+ episodes. Quality matters more than quantity — 300 clean demonstrations outperform 1,000 noisy ones.

Yes, using LoRA (Low-Rank Adaptation) you can fine-tune OpenVLA on an RTX 4090 (24 GB VRAM) or even an RTX 3090. Full fine-tuning requires 40–80 GB VRAM (A100 or multi-GPU setup). LoRA typically achieves 85–95% of full fine-tuning performance.

For in-distribution tasks (same objects, similar positions as training data), expect 60–80% success rate on your first fine-tuning run. With iterative data collection and retraining (DAgger-style), you can push this to 85–95%. Out-of-distribution generalization varies widely.

Was this tutorial helpful?