Robot Learning

Getting Started with World Models for Robot Learning: A Practical Guide

By Jerry Huang · April 22, 2026

This is a hands-on guide to training your first world model for a robotics task. We walk through the full pipeline: choosing a model, preparing data, training, evaluating imagination quality, and using the trained model for planning or policy improvement. All code examples use Dreamer v3 as the primary example, with notes on TD-MPC2 and IRIS alternatives where they differ.

Prerequisites

Hardware

GPU: NVIDIA RTX 3090, RTX 4090, or A100. Dreamer v3 and TD-MPC2 train on a single consumer GPU. VRAM requirement: 12GB minimum for state-based tasks, 24GB recommended for vision-based tasks with 64x64 images. For 256x256 images, you need 24GB+ VRAM.
CPU/RAM: 8+ cores, 32GB+ RAM. Data loading is CPU-bound for large datasets with image observations. NVMe SSD storage is strongly recommended -- HDD-based data loading will bottleneck training.
Robot (optional for initial experiments): You can start with simulation environments (DMControl, Meta-World, Robosuite) that require no physical hardware. When ready for real-robot experiments, any arm with position/velocity control and at least one camera will work.

Software

Python 3.10+
JAX with CUDA (for Dreamer v3) or PyTorch 2.0+ (for TD-MPC2, IRIS)
Gymnasium (formerly OpenAI Gym) for simulation environments
Weights & Biases (optional but recommended for experiment tracking)

Data

You need trajectory data: sequences of (observation, action, reward, done) tuples recorded from your robot or simulation environment. For initial experiments, use a pre-existing dataset or collect data from a simulation environment. For real-robot world models, you need 50-500 episodes depending on task complexity. See our world models overview for detailed data requirements.

Step 1: Choose Your World Model

Use this decision tree to pick the right starting point for your project:

Decision Tree: Which World Model?

Q1: Is your action space continuous?

Yes -> Go to Q2
No (discrete actions) -> Consider IRIS (native discrete support) or Dreamer v3 (also supports discrete)

Q2: Do you need to change the reward/objective at deployment time?

Yes -> Use TD-MPC2 (online planning with flexible reward)
No -> Go to Q3

Q3: Do you need the fastest possible training iteration?

Yes -> Use TD-MPC2 (2-12h training on single GPU)
No -> Go to Q4

Q4: Does your task require planning 15+ steps ahead?

Yes -> Use Dreamer v3 (long imagination horizon with RSSM)
No -> Either Dreamer v3 or TD-MPC2 will work. Dreamer v3 is the safer default.

For the rest of this guide, we use Dreamer v3 as the primary example because it is the most widely applicable. We note TD-MPC2 and IRIS alternatives where they differ.

Step 2: Prepare Your Data

Episode Format

Dreamer v3 expects data as a directory of NPZ files, one per episode. Each NPZ file contains NumPy arrays with the following keys:

import numpy as np

# Example: save one episode with 200 timesteps
episode = {
    # Visual observations: (T, H, W, C), uint8, 0-255
    'image': np.random.randint(0, 255, (200, 64, 64, 3), dtype=np.uint8),

    # Proprioceptive state: (T, state_dim), float32
    # e.g., 7 joint positions + 7 joint velocities + 1 gripper state = 15
    'state': np.random.randn(200, 15).astype(np.float32),

    # Actions: (T, action_dim), float32
    # e.g., 7 joint velocity commands + 1 gripper command = 8
    'action': np.random.randn(200, 8).astype(np.float32),

    # Reward: (T,), float32
    'reward': np.zeros(200, dtype=np.float32),

    # Episode continuation flag: (T,), bool or float32
    # 1.0 = episode continues, 0.0 = episode terminated
    'is_first': np.zeros(200, dtype=np.float32),
    'is_last': np.zeros(200, dtype=np.float32),
    'is_terminal': np.zeros(200, dtype=np.float32),
}
episode['is_first'][0] = 1.0
episode['is_last'][-1] = 1.0
episode['is_terminal'][-1] = 1.0  # True if task completed
episode['reward'][-1] = 1.0  # Sparse success reward

np.savez_compressed('episodes/episode_000.npz', **episode)

Observation Structure

Key decisions for observation representation:

Image resolution: Start with 64x64. This is sufficient for most tabletop manipulation tasks and trains 4x faster than 128x128. Only increase resolution if you observe that the model cannot distinguish task-relevant visual details (e.g., small objects, fine orientations) at 64x64.
Number of cameras: Start with one (overhead or wrist-mounted). Multi-camera setups improve performance but double the data size and training time per additional camera. Add cameras only after validating that a single camera is insufficient.
Proprioception: Always include joint positions, joint velocities, and gripper state. These provide precise state information that complements noisy visual observations. Normalize each dimension to approximately zero mean and unit variance.
History: Dreamer v3's RSSM handles history internally via the recurrent state. You do not need to stack frames or provide observation history -- just the current timestep's observation.

Action Normalization

This is a common source of training failures. All three world models expect actions in a standardized range:

import numpy as np

def normalize_actions(actions, action_low, action_high):
    """Normalize actions to [-1, 1] range."""
    midpoint = (action_high + action_low) / 2.0
    half_range = (action_high - action_low) / 2.0
    return (actions - midpoint) / half_range

def denormalize_actions(normalized_actions, action_low, action_high):
    """Convert [-1, 1] actions back to original range."""
    midpoint = (action_high + action_low) / 2.0
    half_range = (action_high - action_low) / 2.0
    return normalized_actions * half_range + midpoint

# Example: 7-DOF arm with joint velocity limits +/- 1.0 rad/s
# and gripper command 0.0 (open) to 1.0 (closed)
action_low  = np.array([-1.0]*7 + [0.0])
action_high = np.array([ 1.0]*7 + [1.0])

raw_actions = load_recorded_actions()  # shape: (T, 8)
normalized = normalize_actions(raw_actions, action_low, action_high)

Dreamer v3 clips actions to [-1, 1] internally. If your raw actions are already in this range, no normalization is needed. TD-MPC2 also expects [-1, 1] normalized actions. IRIS discretizes actions into bins, so normalization is handled by the binning scheme.

Converting from LeRobot Format

If your data is in HuggingFace LeRobot format, use this conversion script:

from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
from PIL import Image
import numpy as np
import os

dataset = LeRobotDataset("your-username/your-dataset")
os.makedirs("dreamer_episodes", exist_ok=True)

# Group by episode
episode_indices = dataset.hf_dataset.unique("episode_index")

for ep_idx in episode_indices:
    ep_data = dataset.hf_dataset.filter(
        lambda x: x["episode_index"] == ep_idx)

    # Extract observations
    images = []
    for row in ep_data:
        img = row["observation.images.top"]  # PIL Image
        img = img.resize((64, 64))
        images.append(np.array(img))

    episode = {
        "image": np.stack(images).astype(np.uint8),
        "state": np.stack([row["observation.state"] for row in ep_data]).astype(np.float32),
        "action": np.stack([row["action"] for row in ep_data]).astype(np.float32),
        "reward": np.array([row.get("next.reward", 0.0) for row in ep_data], dtype=np.float32),
        "is_first": np.zeros(len(ep_data), dtype=np.float32),
        "is_last": np.zeros(len(ep_data), dtype=np.float32),
        "is_terminal": np.zeros(len(ep_data), dtype=np.float32),
    }
    episode["is_first"][0] = 1.0
    episode["is_last"][-1] = 1.0

    np.savez_compressed(f"dreamer_episodes/episode_{ep_idx:04d}.npz", **episode)

print(f"Converted {len(episode_indices)} episodes")

Step 3: Train

Installing Dreamer v3

# Install JAX with CUDA support
pip install --upgrade "jax[cuda12]" jaxlib

# Clone Dreamer v3
git clone https://github.com/danijar/dreamerv3.git
cd dreamerv3
pip install -e .

# Verify GPU is available
python -c "import jax; print(jax.devices())"
# Should show: [CudaDevice(id=0)]

Configuration

Create a config file for your custom robot task. Dreamer v3 uses YAML configs with sensible defaults that you override:

# config/robot_manipulation.yaml

# Data
data_dir: /path/to/dreamer_episodes/
replay_size: 1e6           # Replay buffer size in transitions
batch_size: 16             # Batch size for training
batch_length: 50           # Subsequence length for training

# World model
rssm:
  deter: 4096              # Deterministic state size (GRU hidden)
  stoch: 32                # Number of categorical variables
  classes: 32              # Classes per categorical variable
  units: 1024              # MLP hidden units
encoder:
  mlp_keys: 'state'        # Which observation keys use MLP encoder
  cnn_keys: 'image'        # Which observation keys use CNN encoder
  cnn_depth: 48            # CNN channel multiplier
decoder:
  mlp_keys: 'state'
  cnn_keys: 'image'
  cnn_depth: 48

# Actor-critic
actor:
  layers: 4                # MLP layers for actor
  units: 640
  dist: trunc_normal       # Action distribution (continuous)
critic:
  layers: 4
  units: 640

# Training
steps: 1e6                 # Total training steps
log_every: 1e4             # Log metrics every N steps
eval_every: 1e5            # Run evaluation every N steps
train_ratio: 512           # Imagination steps per real data step
imag_horizon: 15           # Imagination rollout length

# Optimization
model_lr: 1e-4
actor_lr: 3e-5
critic_lr: 3e-5

Launch Training

# Train from offline data
python dreamerv3/main.py \
  --configs defaults \
  --config config/robot_manipulation.yaml \
  --logdir ./logdir/robot_v1 \
  --steps 1000000

# Monitor with TensorBoard
tensorboard --logdir ./logdir/robot_v1

TD-MPC2 alternative:

# Install
pip install tdmpc2

# Train on your data (HDF5 format)
python train.py \
  task=custom \
  data_dir=/path/to/hdf5_episodes/ \
  model_size=19 \
  steps=500000 \
  batch_size=256

What to Watch During Training

Monitor these metrics in TensorBoard or W&B:

Image reconstruction loss (decoder_image): Should decrease steadily. If it plateaus high, the encoder/decoder capacity is insufficient -- increase cnn_depth.
KL divergence (kl_loss): Should converge to a moderate value (typically 1-10 nats). If it drops to near zero, the model is ignoring the stochastic latent -- increase the KL free bits or decrease the KL scale. If it is very high (>50), the dynamics model is struggling to predict the posterior -- the task may need more training steps.
Reward prediction loss (decoder_reward): Should decrease, especially for tasks with sparse rewards where the model needs to learn which states are terminal/successful.
Actor loss and critic loss: These should decrease over time. If the actor loss is erratic, the imagination rollouts may be unreliable -- check the world model losses first.
Imagined return: The average return of trajectories imagined by rolling out the actor in the world model. Should increase as the policy improves. If it increases rapidly but real evaluation performance does not improve, the policy may be exploiting world model inaccuracies.

Step 4: Evaluate

Imagination Rollout Quality

The most direct way to evaluate your world model is to visualize imagined rollouts and compare them to reality. Given a real episode, encode the first observation, then roll out the world model using the recorded actions and decode the predicted observations:

import numpy as np
from dreamerv3 import Agent

# Load trained agent
agent = Agent.load('./logdir/robot_v1/checkpoint.pkl')

# Load a held-out test episode
episode = np.load('test_episodes/episode_050.npz')
real_images = episode['image']       # (T, 64, 64, 3)
real_actions = episode['action']     # (T, action_dim)

# Encode first observation
obs = {'image': real_images[0:1], 'state': episode['state'][0:1]}
latent = agent.world_model.encode(obs)

# Roll out in imagination using recorded actions
imagined_images = [real_images[0]]  # Start from real first frame
for t in range(1, min(50, len(real_actions))):
    latent = agent.world_model.step(latent, real_actions[t-1:t])
    decoded = agent.world_model.decode(latent)
    imagined_images.append(decoded['image'][0])

imagined_images = np.stack(imagined_images)

# Compare: save side-by-side video
# real_images[:50] vs imagined_images
save_comparison_video(real_images[:50], imagined_images, 'rollout_comparison.mp4')

Quantitative Metrics

Metric	What It Measures	Good Value	Horizon
Reconstruction MSE	Pixel-level prediction accuracy	< 0.01 (normalized)	1-step
SSIM	Structural visual similarity	> 0.85 at step 10, > 0.7 at step 50	Multi-step
Reward prediction accuracy	Can the model predict task success?	> 90% binary classification	Episode-level
State prediction error	Latent state divergence from ground truth	Task-dependent	Multi-step
Policy success rate (in imagination)	Does the learned policy solve the task in the world model?	> 80%	Episode-level
Policy success rate (real robot)	Does the policy actually work?	Task-dependent (gap with imagined rate indicates model error)	Episode-level

The most important evaluation is the gap between imagined success rate and real success rate. If the policy succeeds 95% in imagination but only 40% on the real robot, the world model has significant inaccuracies that the policy is exploiting. Collect more data in the failure regions and retrain.

Step 5: Use for Planning or Policy Improvement

Option A: Deploy the Learned Policy (Dreamer)

Dreamer's actor network is a reactive policy: given the current latent state, it outputs an action in ~1ms. This is the simplest deployment path.

# Deployment loop (simplified)
agent = Agent.load('./logdir/robot_v1/checkpoint.pkl')
latent = None

while not done:
    obs = get_robot_observation()  # {'image': ..., 'state': ...}

    # Encode observation and get action from actor
    latent, action = agent.policy(obs, latent)

    # Denormalize and send to robot
    raw_action = denormalize_actions(action, action_low, action_high)
    send_to_robot(raw_action)

    time.sleep(1.0 / control_frequency)  # e.g., 10Hz

Option B: Model-Predictive Control with TD-MPC2

TD-MPC2 uses the world model directly for online planning via MPPI. This is more compute-intensive but allows changing the objective at runtime.

# TD-MPC2 deployment loop (simplified)
from tdmpc2 import TDMPC2

agent = TDMPC2.load('./checkpoints/tdmpc2_robot.pt')

while not done:
    obs = get_robot_observation()

    # MPPI planning: samples 512 action sequences, scores via world model
    action = agent.act(obs, eval_mode=True)

    send_to_robot(denormalize_actions(action, action_low, action_high))
    time.sleep(1.0 / control_frequency)

Option C: World Model as Data Augmentation

Use the trained world model to generate synthetic episodes for training a downstream Diffusion Policy or VLA. This is the lowest-risk approach -- the world model is used offline, not in the control loop.

import numpy as np
from dreamerv3 import Agent

agent = Agent.load('./logdir/robot_v1/checkpoint.pkl')

def generate_synthetic_episode(agent, initial_obs, num_steps=100):
    """Generate a synthetic episode using the learned policy + world model."""
    latent = agent.world_model.encode(initial_obs)

    images, states, actions, rewards = [], [], [], []
    for t in range(num_steps):
        # Get action from learned policy
        action = agent.actor(latent)

        # Step world model
        latent = agent.world_model.step(latent, action)
        decoded = agent.world_model.decode(latent)
        reward = agent.world_model.reward(latent)

        images.append(decoded['image'][0])
        states.append(decoded['state'][0])
        actions.append(action[0])
        rewards.append(reward[0])

    return {
        'image': np.stack(images),
        'state': np.stack(states),
        'action': np.stack(actions),
        'reward': np.array(rewards),
    }

# Generate 1000 synthetic episodes from diverse initial states
real_episodes = load_real_episodes('dreamer_episodes/')
for i in range(1000):
    # Sample a random real episode for initial observation
    source_ep = real_episodes[np.random.randint(len(real_episodes))]
    start_t = np.random.randint(len(source_ep['image']))
    initial_obs = {
        'image': source_ep['image'][start_t:start_t+1],
        'state': source_ep['state'][start_t:start_t+1]
    }

    synthetic = generate_synthetic_episode(agent, initial_obs)
    np.savez_compressed(f'synthetic_episodes/synth_{i:04d}.npz', **synthetic)

print("Generated 1000 synthetic episodes for downstream policy training")

Common Mistakes

1. Too Little Data

The most common failure. Teams collect 20-50 episodes, train a world model, find it inaccurate, and conclude world models do not work for their task. In reality, 20 episodes is only enough for the simplest reaching tasks. Budget 100-200 episodes minimum for any non-trivial manipulation task. If you cannot collect that many, use TD-MPC2 (which is more data-efficient due to online planning) rather than Dreamer.

2. Wrong Observation Modality

Using only RGB images when proprioception is available. The world model has to learn robot kinematics from pixels alone, which requires far more data than providing joint states directly. Always include proprioceptive state alongside images. Conversely, using only proprioception when the task requires reasoning about object positions or orientations that are not captured in the robot state -- in this case, you need images.

3. Insufficient Episode Diversity

A dataset where every episode starts with the object in the same position teaches the world model a narrow slice of the state space. When the object is in a slightly different position at test time, the model makes wild predictions. Deliberately randomize initial conditions during data collection: object positions, orientations, robot starting configuration, and environmental conditions (lighting, background).

4. Mismatched Control Frequency

Training data recorded at 30Hz but deploying the policy at 10Hz (or vice versa). The world model learns dynamics at the data's time resolution. If you train at 30Hz, each action produces a small state change. If you then deploy at 10Hz, each action should produce a 3x larger state change -- but the model has never seen that. Always deploy at the same frequency used during data collection. If you must change frequency, re-collect data at the target frequency.

5. Ignoring the Reward Signal

For tasks with sparse rewards (success/failure at episode end), the reward predictor needs enough positive examples to learn. If only 10% of your episodes are successful, the model may never learn to predict rewards accurately. Either (a) collect more successful demonstrations, (b) add intermediate reward shaping, or (c) use the world model only for dynamics (data augmentation) rather than policy optimization.

6. Training Too Long Without Evaluation

World model training losses can continue decreasing while actual policy performance stagnates or degrades (overfitting to the replay buffer). Evaluate on the real robot (or held-out simulation episodes) every 100K training steps. If performance degrades, stop training and use the last good checkpoint.

Where to Get Training Data

SVRC Data Services

SVRC's managed data collection service produces datasets optimized for world model training. Episodes are recorded at a fixed 50Hz with hardware-timestamped synchronization, multi-view RGB, calibrated proprioception, and structured variation across initial conditions. Datasets are delivered in Dreamer-compatible NPZ format, TD-MPC2 HDF5 format, or LeRobot Parquet format based on your preference.

For world model projects specifically, we offer a "dynamics diversity" collection protocol that maximizes state-space coverage: episodes include both successful and unsuccessful task completions, varied approach strategies, perturbation recovery, and edge-case scenarios. This produces world models with broader coverage than success-only datasets. Pilot campaigns start at $2,500 for 200 episodes; contact us for a custom quote.

Open Datasets

Dataset	Episodes	Robots	Observations	World Model Suitability
DROID	76K	Franka Panda	Multi-view RGB, proprio, language	Excellent -- large, diverse, multi-task
Bridge V2	60K	WidowX	Single RGB, proprio, language	Good -- diverse tasks, single viewpoint
Open X-Embodiment	1M+	22 robot types	Varies by sub-dataset	Good for pre-training foundation world models
RoboTurk	2.1K	Sawyer	RGB, proprio	Fair -- smaller scale, good for initial testing
SVRC Public Datasets	Varies	UR5e, Franka, ALOHA	Multi-view RGB, proprio, F/T	Good -- pre-formatted for Dreamer/TD-MPC2

For pre-training a general-purpose world model, start with Open X-Embodiment or DROID, then fine-tune on data from your specific robot and task. For task-specific world models without pre-training, DROID or Bridge V2 filtered to manipulation tasks similar to yours provides a good starting point. SVRC's public datasets are already formatted for direct use with Dreamer v3 and TD-MPC2.

Frequently Asked Questions

How much data do I need to train a world model for my robot?

For a single manipulation task with Dreamer v3 or TD-MPC2, you need 50-200 episodes (10K-50K transitions) as a minimum. For robust deployment performance with object position variation, budget 300-500 episodes. IRIS requires 2-3x more due to VQ-VAE tokenizer training overhead. Quality and diversity matter more than raw volume -- 200 diverse episodes outperform 1000 near-identical ones.

Can I train a world model on a single consumer GPU?

Yes. Dreamer v3 and TD-MPC2 both train on a single RTX 3090 or RTX 4090. Training times range from 2-24 hours depending on dataset size and model configuration. IRIS is more compute-intensive (12-48 hours on a single GPU). You do not need A100s or multi-GPU setups unless you are training UniSim-scale pixel-space world models.

What is the difference between a world model and a simulator like MuJoCo or Isaac Sim?

A physics simulator implements hand-engineered equations of motion (rigid body dynamics, contact models, friction cones). A world model learns dynamics from data. The simulator is accurate for rigid bodies but struggles with deformable objects, cables, liquids, and sensor noise. The world model captures whatever dynamics are present in its training data, including phenomena that are hard to simulate analytically. The tradeoff is that the world model can only predict accurately within its training distribution.

Should I use a world model or a diffusion policy?

They are not mutually exclusive. A world model learns environment dynamics for planning and data augmentation. A diffusion policy is a control policy that generates actions from observations. You can use a world model to generate synthetic training data for a diffusion policy, or use a world model to evaluate and select among candidate trajectories proposed by a diffusion policy. If you must choose one, use a diffusion policy for direct imitation learning from demonstrations, and a world model when you need online RL, planning, or the ability to change objectives at deployment time.

How do I know if my world model is accurate enough for deployment?

Evaluate on held-out real-world trajectories. Encode the initial observation, roll out the world model using the recorded actions, and compare predicted observations against the actual recorded observations. Key metrics: reconstruction MSE, SSIM for visual quality, reward prediction accuracy, and state prediction error at horizon 10/20/50. If imagined rollouts diverge visibly from reality within your task's typical episode length, the model needs more data or architectural changes.

Can I fine-tune a pre-trained world model on my robot's data?

Yes, and this is increasingly the recommended approach. For Dreamer v3 and TD-MPC2, you can initialize from a checkpoint trained on simulation data (e.g., from MuJoCo or Isaac Sim) and fine-tune on real robot data. This typically requires 50-100 real episodes to adapt the dynamics model to real-world physics, compared to 200-500 episodes when training from scratch. Foundation world models like Genie are explicitly designed for this pre-train-then-fine-tune paradigm.