Getting Started with World Models for Robot Learning: A Practical Guide
By Jerry Huang · April 22, 2026
This is a hands-on guide to training your first world model for a robotics task. We walk through the full pipeline: choosing a model, preparing data, training, evaluating imagination quality, and using the trained model for planning or policy improvement. All code examples use Dreamer v3 as the primary example, with notes on TD-MPC2 and IRIS alternatives where they differ.
Prerequisites
Hardware
- GPU: NVIDIA RTX 3090, RTX 4090, or A100. Dreamer v3 and TD-MPC2 train on a single consumer GPU. VRAM requirement: 12GB minimum for state-based tasks, 24GB recommended for vision-based tasks with 64x64 images. For 256x256 images, you need 24GB+ VRAM.
- CPU/RAM: 8+ cores, 32GB+ RAM. Data loading is CPU-bound for large datasets with image observations. NVMe SSD storage is strongly recommended -- HDD-based data loading will bottleneck training.
- Robot (optional for initial experiments): You can start with simulation environments (DMControl, Meta-World, Robosuite) that require no physical hardware. When ready for real-robot experiments, any arm with position/velocity control and at least one camera will work.
Software
- Python 3.10+
- JAX with CUDA (for Dreamer v3) or PyTorch 2.0+ (for TD-MPC2, IRIS)
- Gymnasium (formerly OpenAI Gym) for simulation environments
- Weights & Biases (optional but recommended for experiment tracking)
Data
You need trajectory data: sequences of (observation, action, reward, done) tuples recorded from your robot or simulation environment. For initial experiments, use a pre-existing dataset or collect data from a simulation environment. For real-robot world models, you need 50-500 episodes depending on task complexity. See our world models overview for detailed data requirements.
Step 1: Choose Your World Model
Use this decision tree to pick the right starting point for your project:
Decision Tree: Which World Model?
Q1: Is your action space continuous?
- Yes -> Go to Q2
- No (discrete actions) -> Consider IRIS (native discrete support) or Dreamer v3 (also supports discrete)
Q2: Do you need to change the reward/objective at deployment time?
- Yes -> Use TD-MPC2 (online planning with flexible reward)
- No -> Go to Q3
Q3: Do you need the fastest possible training iteration?
- Yes -> Use TD-MPC2 (2-12h training on single GPU)
- No -> Go to Q4
Q4: Does your task require planning 15+ steps ahead?
- Yes -> Use Dreamer v3 (long imagination horizon with RSSM)
- No -> Either Dreamer v3 or TD-MPC2 will work. Dreamer v3 is the safer default.
For the rest of this guide, we use Dreamer v3 as the primary example because it is the most widely applicable. We note TD-MPC2 and IRIS alternatives where they differ.
Step 2: Prepare Your Data
Episode Format
Dreamer v3 expects data as a directory of NPZ files, one per episode. Each NPZ file contains NumPy arrays with the following keys:
Observation Structure
Key decisions for observation representation:
- Image resolution: Start with 64x64. This is sufficient for most tabletop manipulation tasks and trains 4x faster than 128x128. Only increase resolution if you observe that the model cannot distinguish task-relevant visual details (e.g., small objects, fine orientations) at 64x64.
- Number of cameras: Start with one (overhead or wrist-mounted). Multi-camera setups improve performance but double the data size and training time per additional camera. Add cameras only after validating that a single camera is insufficient.
- Proprioception: Always include joint positions, joint velocities, and gripper state. These provide precise state information that complements noisy visual observations. Normalize each dimension to approximately zero mean and unit variance.
- History: Dreamer v3's RSSM handles history internally via the recurrent state. You do not need to stack frames or provide observation history -- just the current timestep's observation.
Action Normalization
This is a common source of training failures. All three world models expect actions in a standardized range:
Dreamer v3 clips actions to [-1, 1] internally. If your raw actions are already in this range, no normalization is needed. TD-MPC2 also expects [-1, 1] normalized actions. IRIS discretizes actions into bins, so normalization is handled by the binning scheme.
Converting from LeRobot Format
If your data is in HuggingFace LeRobot format, use this conversion script:
Step 3: Train
Installing Dreamer v3
Configuration
Create a config file for your custom robot task. Dreamer v3 uses YAML configs with sensible defaults that you override:
Launch Training
TD-MPC2 alternative:
What to Watch During Training
Monitor these metrics in TensorBoard or W&B:
- Image reconstruction loss (decoder_image): Should decrease steadily. If it plateaus high, the encoder/decoder capacity is insufficient -- increase cnn_depth.
- KL divergence (kl_loss): Should converge to a moderate value (typically 1-10 nats). If it drops to near zero, the model is ignoring the stochastic latent -- increase the KL free bits or decrease the KL scale. If it is very high (>50), the dynamics model is struggling to predict the posterior -- the task may need more training steps.
- Reward prediction loss (decoder_reward): Should decrease, especially for tasks with sparse rewards where the model needs to learn which states are terminal/successful.
- Actor loss and critic loss: These should decrease over time. If the actor loss is erratic, the imagination rollouts may be unreliable -- check the world model losses first.
- Imagined return: The average return of trajectories imagined by rolling out the actor in the world model. Should increase as the policy improves. If it increases rapidly but real evaluation performance does not improve, the policy may be exploiting world model inaccuracies.
Step 4: Evaluate
Imagination Rollout Quality
The most direct way to evaluate your world model is to visualize imagined rollouts and compare them to reality. Given a real episode, encode the first observation, then roll out the world model using the recorded actions and decode the predicted observations:
Quantitative Metrics
| Metric | What It Measures | Good Value | Horizon |
|---|---|---|---|
| Reconstruction MSE | Pixel-level prediction accuracy | < 0.01 (normalized) | 1-step |
| SSIM | Structural visual similarity | > 0.85 at step 10, > 0.7 at step 50 | Multi-step |
| Reward prediction accuracy | Can the model predict task success? | > 90% binary classification | Episode-level |
| State prediction error | Latent state divergence from ground truth | Task-dependent | Multi-step |
| Policy success rate (in imagination) | Does the learned policy solve the task in the world model? | > 80% | Episode-level |
| Policy success rate (real robot) | Does the policy actually work? | Task-dependent (gap with imagined rate indicates model error) | Episode-level |
The most important evaluation is the gap between imagined success rate and real success rate. If the policy succeeds 95% in imagination but only 40% on the real robot, the world model has significant inaccuracies that the policy is exploiting. Collect more data in the failure regions and retrain.
Step 5: Use for Planning or Policy Improvement
Option A: Deploy the Learned Policy (Dreamer)
Dreamer's actor network is a reactive policy: given the current latent state, it outputs an action in ~1ms. This is the simplest deployment path.
Option B: Model-Predictive Control with TD-MPC2
TD-MPC2 uses the world model directly for online planning via MPPI. This is more compute-intensive but allows changing the objective at runtime.
Option C: World Model as Data Augmentation
Use the trained world model to generate synthetic episodes for training a downstream Diffusion Policy or VLA. This is the lowest-risk approach -- the world model is used offline, not in the control loop.
Common Mistakes
1. Too Little Data
The most common failure. Teams collect 20-50 episodes, train a world model, find it inaccurate, and conclude world models do not work for their task. In reality, 20 episodes is only enough for the simplest reaching tasks. Budget 100-200 episodes minimum for any non-trivial manipulation task. If you cannot collect that many, use TD-MPC2 (which is more data-efficient due to online planning) rather than Dreamer.
2. Wrong Observation Modality
Using only RGB images when proprioception is available. The world model has to learn robot kinematics from pixels alone, which requires far more data than providing joint states directly. Always include proprioceptive state alongside images. Conversely, using only proprioception when the task requires reasoning about object positions or orientations that are not captured in the robot state -- in this case, you need images.
3. Insufficient Episode Diversity
A dataset where every episode starts with the object in the same position teaches the world model a narrow slice of the state space. When the object is in a slightly different position at test time, the model makes wild predictions. Deliberately randomize initial conditions during data collection: object positions, orientations, robot starting configuration, and environmental conditions (lighting, background).
4. Mismatched Control Frequency
Training data recorded at 30Hz but deploying the policy at 10Hz (or vice versa). The world model learns dynamics at the data's time resolution. If you train at 30Hz, each action produces a small state change. If you then deploy at 10Hz, each action should produce a 3x larger state change -- but the model has never seen that. Always deploy at the same frequency used during data collection. If you must change frequency, re-collect data at the target frequency.
5. Ignoring the Reward Signal
For tasks with sparse rewards (success/failure at episode end), the reward predictor needs enough positive examples to learn. If only 10% of your episodes are successful, the model may never learn to predict rewards accurately. Either (a) collect more successful demonstrations, (b) add intermediate reward shaping, or (c) use the world model only for dynamics (data augmentation) rather than policy optimization.
6. Training Too Long Without Evaluation
World model training losses can continue decreasing while actual policy performance stagnates or degrades (overfitting to the replay buffer). Evaluate on the real robot (or held-out simulation episodes) every 100K training steps. If performance degrades, stop training and use the last good checkpoint.
Where to Get Training Data
SVRC Data Services
SVRC's managed data collection service produces datasets optimized for world model training. Episodes are recorded at a fixed 50Hz with hardware-timestamped synchronization, multi-view RGB, calibrated proprioception, and structured variation across initial conditions. Datasets are delivered in Dreamer-compatible NPZ format, TD-MPC2 HDF5 format, or LeRobot Parquet format based on your preference.
For world model projects specifically, we offer a "dynamics diversity" collection protocol that maximizes state-space coverage: episodes include both successful and unsuccessful task completions, varied approach strategies, perturbation recovery, and edge-case scenarios. This produces world models with broader coverage than success-only datasets. Pilot campaigns start at $2,500 for 200 episodes; contact us for a custom quote.
Open Datasets
| Dataset | Episodes | Robots | Observations | World Model Suitability |
|---|---|---|---|---|
| DROID | 76K | Franka Panda | Multi-view RGB, proprio, language | Excellent -- large, diverse, multi-task |
| Bridge V2 | 60K | WidowX | Single RGB, proprio, language | Good -- diverse tasks, single viewpoint |
| Open X-Embodiment | 1M+ | 22 robot types | Varies by sub-dataset | Good for pre-training foundation world models |
| RoboTurk | 2.1K | Sawyer | RGB, proprio | Fair -- smaller scale, good for initial testing |
| SVRC Public Datasets | Varies | UR5e, Franka, ALOHA | Multi-view RGB, proprio, F/T | Good -- pre-formatted for Dreamer/TD-MPC2 |
For pre-training a general-purpose world model, start with Open X-Embodiment or DROID, then fine-tune on data from your specific robot and task. For task-specific world models without pre-training, DROID or Bridge V2 filtered to manipulation tasks similar to yours provides a good starting point. SVRC's public datasets are already formatted for direct use with Dreamer v3 and TD-MPC2.
Frequently Asked Questions
How much data do I need to train a world model for my robot?
For a single manipulation task with Dreamer v3 or TD-MPC2, you need 50-200 episodes (10K-50K transitions) as a minimum. For robust deployment performance with object position variation, budget 300-500 episodes. IRIS requires 2-3x more due to VQ-VAE tokenizer training overhead. Quality and diversity matter more than raw volume -- 200 diverse episodes outperform 1000 near-identical ones.
Can I train a world model on a single consumer GPU?
Yes. Dreamer v3 and TD-MPC2 both train on a single RTX 3090 or RTX 4090. Training times range from 2-24 hours depending on dataset size and model configuration. IRIS is more compute-intensive (12-48 hours on a single GPU). You do not need A100s or multi-GPU setups unless you are training UniSim-scale pixel-space world models.
What is the difference between a world model and a simulator like MuJoCo or Isaac Sim?
A physics simulator implements hand-engineered equations of motion (rigid body dynamics, contact models, friction cones). A world model learns dynamics from data. The simulator is accurate for rigid bodies but struggles with deformable objects, cables, liquids, and sensor noise. The world model captures whatever dynamics are present in its training data, including phenomena that are hard to simulate analytically. The tradeoff is that the world model can only predict accurately within its training distribution.
Should I use a world model or a diffusion policy?
They are not mutually exclusive. A world model learns environment dynamics for planning and data augmentation. A diffusion policy is a control policy that generates actions from observations. You can use a world model to generate synthetic training data for a diffusion policy, or use a world model to evaluate and select among candidate trajectories proposed by a diffusion policy. If you must choose one, use a diffusion policy for direct imitation learning from demonstrations, and a world model when you need online RL, planning, or the ability to change objectives at deployment time.
How do I know if my world model is accurate enough for deployment?
Evaluate on held-out real-world trajectories. Encode the initial observation, roll out the world model using the recorded actions, and compare predicted observations against the actual recorded observations. Key metrics: reconstruction MSE, SSIM for visual quality, reward prediction accuracy, and state prediction error at horizon 10/20/50. If imagined rollouts diverge visibly from reality within your task's typical episode length, the model needs more data or architectural changes.
Can I fine-tune a pre-trained world model on my robot's data?
Yes, and this is increasingly the recommended approach. For Dreamer v3 and TD-MPC2, you can initialize from a checkpoint trained on simulation data (e.g., from MuJoCo or Isaac Sim) and fine-tune on real robot data. This typically requires 50-100 real episodes to adapt the dynamics model to real-world physics, compared to 200-500 episodes when training from scratch. Foundation world models like Genie are explicitly designed for this pre-train-then-fine-tune paradigm.