Dreamer vs IRIS vs TD-MPC2: Choosing a World Model for Your Robot

By Jerry Huang · April 22, 2026

Three world model architectures dominate robotics research in 2026: Dreamer v3, IRIS, and TD-MPC2. Each takes a fundamentally different approach to learning environment dynamics, and the right choice depends on your task, hardware, data, and deployment requirements. This article provides a practitioner-focused comparison to help you make that decision.

The World Model Paradigm: A Quick Recap

A world model learns to predict what happens next: given the current state and an action, it forecasts the future state and reward. Once trained, the model serves as a learned simulator. An agent can "imagine" thousands of trajectories inside the world model to learn a policy, plan actions, or evaluate safety -- all without touching the real robot. For a thorough introduction, see our companion article: World Models for Robotics: Why They Matter and How They Work.

The three models compared here represent three distinct design philosophies:

  • Dreamer v3: Recurrent state-space model with stochastic latent variables. Learns an actor-critic policy in imagination.
  • IRIS: Tokenized observations and actions modeled by an autoregressive Transformer. Learns by next-token prediction.
  • TD-MPC2: Simple MLP-based latent dynamics model. Uses online model-predictive control (MPPI) for action selection.

Architecture Deep Dive

Dreamer v3: Recurrent State-Space Model

Dreamer v3 uses a Recurrent State-Space Model (RSSM) that maintains two types of state at each timestep: a deterministic recurrent state h_t computed by a GRU, and a stochastic latent state z_t sampled from a categorical distribution. The deterministic state captures long-term dependencies (what has happened in the episode so far), while the stochastic state captures uncertainty about the current situation (what the model is not sure about).

The full model consists of five components trained jointly:

  • Sequence model: h_t = f(h_{t-1}, z_{t-1}, a_{t-1}) -- GRU that integrates history
  • Encoder: z_t ~ q(z_t | h_t, o_t) -- posterior that incorporates the actual observation
  • Dynamics predictor: z_t ~ p(z_t | h_t) -- prior that predicts z_t without seeing o_t (used in imagination)
  • Decoder: o_t ~ p(o_t | h_t, z_t) -- reconstructs observations for training signal
  • Reward/continue predictors: r_t ~ p(r_t | h_t, z_t), c_t ~ p(c_t | h_t, z_t) -- predict reward and episode continuation

During imagination, the encoder is not used (there are no real observations). The dynamics predictor samples z_t from the prior, the sequence model advances h_t, and the actor selects actions. The critic evaluates imagined trajectories to compute advantages for actor updates using a lambda-return estimator.

Key design choice in v3: the stochastic state uses 32 categorical variables, each with 32 classes, giving a discrete latent space of 32^32 possible states. This discrete representation avoids posterior collapse issues that plagued earlier versions with Gaussian latent variables and provides sharper, more informative representations.

IRIS: Tokenized Autoregressive World Model

IRIS rethinks the world model as a sequence modeling problem. It first trains a VQ-VAE (Vector Quantized Variational Autoencoder) to compress each observation frame into a fixed number of discrete tokens (typically 16-64 tokens per frame at a codebook size of 512-1024). Actions are also discretized if they are continuous. The world model is then an autoregressive Transformer that predicts the next token given all previous observation and action tokens.

The sequence structure for a single transition looks like:

[obs_tokens_t] [action_token_t] [obs_tokens_{t+1}] [action_token_{t+1}] ...

The Transformer processes this flat sequence with causal attention, predicting each token from all preceding tokens. This is architecturally identical to GPT-style language modeling -- the "language" is just observation and action tokens instead of word tokens.

IRIS's advantage is that it directly leverages the Transformer scaling properties that have been validated at enormous scale in NLP. The disadvantage is that the VQ-VAE tokenization is lossy: spatial details below the resolution of the codebook are discarded. For tasks that require distinguishing between object positions that differ by a few pixels, this can be a bottleneck. The tokenizer quality effectively sets a ceiling on world model accuracy.

TD-MPC2: Latent Dynamics with Online Planning

TD-MPC2 takes the simplest architectural approach of the three. It learns four MLP-based components:

  • Encoder: h(o_t) -> z_t -- maps observations to latent states
  • Dynamics model: d(z_t, a_t) -> z_{t+1} -- predicts next latent state
  • Reward predictor: R(z_t, a_t) -> r_t -- predicts immediate reward
  • Value function: Q(z_t, a_t) -> v -- estimates long-term value (TD learning)

There is no recurrence, no attention, no stochastic latent variables -- just standard feed-forward networks. The entire model is trained end-to-end with a joint loss combining latent dynamics consistency (the predicted next latent should match the encoded next observation), reward prediction, and temporal difference (TD) value learning.

At inference time, TD-MPC2 does not use a learned policy. Instead, it performs Model Predictive Path Integral (MPPI) planning at every control step:

  1. Sample N candidate action sequences (typically N=512, horizon=5 steps)
  2. Roll each sequence out through the dynamics model to get predicted latent trajectories
  3. Score each trajectory: sum of predicted rewards + terminal value from Q-function
  4. Compute a weighted average of the action sequences, weighted by exponentiated scores
  5. Execute the first action from the weighted average sequence

This planning-at-inference approach means the "policy" is implicit -- it emerges from the combination of the world model and the planning algorithm. The major benefit is flexibility: you can change the reward function at deployment time without retraining anything. The cost is that planning takes ~10-50ms per control step, which limits the control frequency to ~20-100Hz depending on GPU and planning horizon.

Head-to-Head Comparison

DimensionDreamer v3IRISTD-MPC2
RepresentationDiscrete categorical latent (32x32)VQ-VAE discrete tokensContinuous latent vector
Temporal modelGRU recurrenceCausal Transformer attentionSingle-step MLP (no history)
Action spaceContinuous or discreteDiscrete (continuous requires binning)Continuous (native)
Policy learningActor-critic in imaginationActor-critic in imaginationOnline MPPI planning (no explicit policy)
Training cost (200 episodes)6-24h on 1x RTX 309012-48h on 1x RTX 30902-12h on 1x RTX 3090
Inference latency~1ms (policy forward pass)~5ms (autoregressive decoding)~10-50ms (MPPI planning loop)
Max control freq~100-500Hz~50-200Hz~20-100Hz
Imagination horizon15-50 steps (configurable)Limited by context window3-10 steps (MPPI horizon)
Reward flexibilityFixed at training timeFixed at training timeChangeable at deployment time
Open-source repodanijar/dreamerv3 (JAX)eloialonso/iris (PyTorch)nicklashansen/tdmpc2 (PyTorch)
LicenseMITMITMIT

When to Use Each

Choose Dreamer v3 When:

  • You need continuous control with long planning horizons. Dreamer's RSSM can imagine 15-50 steps ahead without prohibitive compute cost, and the actor-critic learns to optimize over these long horizons. This makes it the best choice for tasks like dexterous manipulation, where the robot must plan a grasp approach, finger placement, and lift as a coordinated sequence.
  • You want a proven, well-documented framework. Dreamer v3 has been applied to 150+ tasks and comes with a single set of hyperparameters that works out of the box for most domains. The codebase is clean and well-maintained.
  • Your action space is continuous. Dreamer natively supports continuous actions through its Gaussian actor. No binning or discretization is needed.
  • You are doing online RL on a real robot. Dreamer's sample efficiency (100-1000x better than model-free methods) means you can learn from real-robot interactions in hours rather than weeks. The imagination-based policy updates are risk-free -- only the data collection interacts with the real world.
  • You are comfortable with JAX. The reference implementation is in JAX, which provides excellent GPU utilization but has a steeper learning curve than PyTorch. Community ports to PyTorch exist but may lag behind the official release.

Choose IRIS When:

  • Your actions are naturally discrete or you can discretize them. IRIS models everything as tokens, so it works best when the action space is already discrete (gripper open/close, navigation commands) or can be effectively binned into a manageable number of categories.
  • You want to leverage Transformer scaling. If you have access to large compute and expect to scale your world model to handle diverse tasks, IRIS's architecture is the most natural fit for scaling up. The same architecture can grow from 80M to multi-billion parameters using established Transformer scaling recipes.
  • You are interested in generative world models. Because IRIS generates observation tokens autoregressively, you can sample diverse possible futures for the same action sequence. This is useful for uncertainty estimation and for generating training data for downstream policies.
  • Your task domain has been validated. IRIS has been primarily validated on Atari and simple control tasks. If your robotics application involves continuous, high-dimensional action spaces (7-DOF arm + gripper), you should prototype carefully and verify that action discretization does not bottleneck performance.

Choose TD-MPC2 When:

  • You need flexible reward functions at deployment. Because TD-MPC2 plans online using the world model and a learned value function, you can modify the reward function at test time without retraining. Specify a new target position, add a collision penalty, or change the task objective entirely -- the planner will adapt. This is uniquely valuable for applications where the goal is specified at runtime.
  • You want the fastest training. TD-MPC2's MLP-based architecture trains 2-5x faster than Dreamer and 5-10x faster than IRIS. For rapid prototyping and iteration, this is a significant advantage.
  • You need a single model across many tasks. TD-MPC2 demonstrated a single model solving 104 tasks (the "317M parameter" variant). If your deployment involves many task variations with the same robot, a single large TD-MPC2 model may be more practical than training separate Dreamer models per task.
  • Your control frequency requirement is 10-50Hz. The MPPI planning loop adds latency compared to a learned policy forward pass, but for most manipulation tasks (10-30Hz control), the 10-50ms planning time is acceptable.
  • You prefer PyTorch. The reference implementation is clean PyTorch with minimal dependencies.

Integration with Existing Stacks

ROS2 Integration

None of the three frameworks ship with native ROS2 integration, but wrapping them in a ROS2 node is straightforward. The pattern is the same for all three:

import rclpy from rclpy.node import Node from sensor_msgs.msg import JointState, Image from std_msgs.msg import Float64MultiArray class WorldModelNode(Node): def __init__(self, world_model, planner): super().__init__('world_model_node') self.model = world_model self.planner = planner # Subscribe to observations self.create_subscription(Image, '/camera/image_raw', self.image_cb, 10) self.create_subscription(JointState, '/joint_states', self.joint_cb, 10) # Publish actions self.action_pub = self.create_publisher( Float64MultiArray, '/joint_commands', 10) # Control loop timer (e.g., 20Hz) self.create_timer(0.05, self.control_step) def control_step(self): obs = self.get_current_observation() # combine image + joints action = self.planner.plan(self.model, obs) msg = Float64MultiArray(data=action.tolist()) self.action_pub.publish(msg)

For TD-MPC2, the planner is the MPPI algorithm. For Dreamer, it is the learned actor network. For IRIS, it is either a learned actor or search over the tokenized action space. The key difference in ROS2 integration is latency: Dreamer's actor forward pass completes in ~1ms, while TD-MPC2's MPPI loop needs 10-50ms, which affects how you configure the control timer.

LeRobot Integration

HuggingFace's LeRobot framework is becoming the standard for robot learning experiments. As of early 2026, LeRobot natively supports behavioral cloning and Diffusion Policy but does not include world model training. However, the dataset format (Parquet episodes with synchronized video and state) is compatible with all three world models with a lightweight adapter:

from lerobot.common.datasets.lerobot_dataset import LeRobotDataset import numpy as np # Load a LeRobot dataset dataset = LeRobotDataset("lerobot/pusht") # Convert to Dreamer-compatible NPZ episodes for episode_idx in dataset.episode_data_index["from"]: episode = dataset.hf_dataset.filter( lambda x: x["episode_index"] == episode_idx) np.savez_compressed(f"episodes/episode_{episode_idx}.npz", image=np.stack(episode["observation.image"]), state=np.stack(episode["observation.state"]), action=np.stack(episode["action"]), reward=np.array(episode["next.reward"]), is_terminal=np.array(episode["next.done"]) )

For TD-MPC2, convert to its HDF5 format instead. For IRIS, you need to additionally train the VQ-VAE tokenizer on the observation images before training the Transformer world model.

Gymnasium / MuJoCo Integration

All three frameworks support Gymnasium (formerly OpenAI Gym) environments natively or with minimal configuration. This makes them easy to prototype with in simulation before moving to real hardware:

  • Dreamer v3: Built-in support for DMControl, Atari, Minecraft, and custom Gymnasium envs via a wrapper.
  • IRIS: Built-in Atari support. Gymnasium continuous control requires a custom wrapper for action discretization.
  • TD-MPC2: Built-in support for DMControl, Meta-World, Maniskill, and MyoSuite. Adding new Gymnasium envs requires implementing a task config file (10-20 lines).

Data Requirements Per Model

Type of Trajectories

All three models learn from trajectory data, but they are sensitive to different properties:

  • Dreamer v3 needs trajectories with clear temporal structure. It benefits from episodes that include both successful and failed attempts, as the reward predictor needs negative examples to calibrate. Random exploration data is useful for initial world model pre-training; task-specific demonstrations improve the policy.
  • IRIS needs diverse trajectories to train a good VQ-VAE codebook. If all your episodes look visually similar, the codebook will have poor coverage of novel situations. Prioritize visual diversity: different object colors, positions, lighting conditions, and camera angles.
  • TD-MPC2 is the most flexible about data quality because it uses online planning rather than a learned policy. Even with a somewhat inaccurate world model, MPPI planning can often find good actions by evaluating many candidates. However, the value function (Q-network) needs sufficient data coverage to provide useful terminal value estimates.

Observation Modality

ModalityDreamer v3IRISTD-MPC2
RGB imagesNative (CNN encoder)Native (VQ-VAE)Supported (CNN encoder)
Proprioception onlySupported (MLP encoder)Requires state tokenizationNative (default modality)
Multi-cameraConcatenate or multi-encoderSeparate VQ-VAE per viewConcatenate encodings
Point cloudsCustom encoder neededNot well-supportedCustom encoder needed
Force/torqueConcatenate with proprioceptionTokenize as additional modalityConcatenate with proprioception

Episode Length

The three models handle episode length differently:

  • Dreamer v3: Trains on fixed-length subsequences (typically 50-64 steps) sampled from episodes of any length. Long episodes (1000+ steps) are fine -- the RSSM's recurrent state maintains context. This makes Dreamer well-suited to long manipulation tasks.
  • IRIS: Limited by the Transformer's context window. With a 1024-token context and 16 tokens per frame + 1 action token, you get roughly 60 frames of context. Longer episodes require truncation or chunking, which can lose long-range dependencies.
  • TD-MPC2: Trains on short subsequences (typically 5-10 steps) because the dynamics model is single-step and the Q-function handles long-term credit assignment via TD learning. Episode length does not affect training, but the MPPI planning horizon is typically only 3-10 steps, so the planner relies heavily on the Q-function for long-horizon tasks.

Code Quickstart

Dreamer v3

# Install (JAX + GPU) pip install jax[cuda12] jaxlib dreamerv3 # Train on DMControl Reacher python dreamerv3/main.py \ --configs dmc_vision \ --task dmc_reacher_easy \ --logdir ./logdir/reacher \ --steps 500000 # Train on custom robot data (NPZ format) python dreamerv3/main.py \ --configs dmc_vision \ --task custom_robot \ --logdir ./logdir/custom \ --data_dir /path/to/npz_episodes/ \ --steps 1000000

IRIS

# Install git clone https://github.com/eloialonso/iris.git cd iris && pip install -e . # Train on Atari Breakout python src/main.py env.train.id=BreakoutNoFrameskip-v4 \ common.device=cuda:0 \ wandb.mode=offline # Key config overrides for custom domains # config/trainer.yaml: # tokenizer.vocab_size: 512 # world_model.tokens_per_block: 17 # 16 obs tokens + 1 action token # world_model.max_blocks: 20 # context window in frames

TD-MPC2

# Install pip install tdmpc2 # Train on DMControl Walker python train.py task=dog-run model_size=48 steps=10000000 # Train single model on 80 DMControl tasks python train.py task=mt80 model_size=317 steps=25000000 batch_size=1024 # Evaluate with MPPI planning python evaluate.py task=dog-run checkpoint=/path/to/model.pt \ num_samples=512 horizon=5

Common Failure Modes and How to Debug

World Model Predicts Blurry / Averaged Futures

Symptom: Decoded imagined observations look like an average of multiple plausible outcomes. Objects appear ghostly or duplicated.

Diagnosis: The model's latent space is not capturing multimodality. Common with Gaussian latent variables (Dreamer v1/v2) or undersized VQ-VAE codebooks (IRIS).

Fix: For Dreamer, ensure you are using v3's discrete categorical latents, not Gaussian. For IRIS, increase the VQ-VAE codebook size (try 1024 or 2048) and tokens per frame (32 or 64). For TD-MPC2, this is less of an issue because the latent space is task-conditioned and does not need to reconstruct observations.

Policy Exploits World Model Inaccuracies

Symptom: The policy achieves high imagined reward but fails on the real robot. It has found a "glitch" in the world model -- a state-action sequence that gets high predicted reward but does not correspond to real physics.

Diagnosis: The policy has moved into a region of state-action space where the world model is inaccurate, and it is exploiting those inaccuracies.

Fix: (1) Train an ensemble of 3-5 world models and penalize the policy for visiting states where ensemble predictions disagree. (2) Shorten the imagination horizon to reduce compounding error. (3) Collect more real data in the regions where the policy is operating and retrain the world model. For TD-MPC2, the MPPI planning horizon is already short (3-10 steps), making this less common.

VQ-VAE Codebook Collapse (IRIS)

Symptom: Only a small fraction of codebook entries are used. Reconstructed observations lack detail.

Diagnosis: The VQ-VAE training collapsed to using a subset of codes. This is a well-known issue with VQ-VAEs.

Fix: (1) Use exponential moving average (EMA) codebook updates instead of gradient-based. (2) Increase codebook size. (3) Add codebook reset: periodically reinitialize unused codes from the encoder output distribution. (4) Use a commitment loss weight of 0.25 (the default in IRIS).

TD Learning Instability (TD-MPC2)

Symptom: Q-values diverge or oscillate during training. The MPPI planner produces erratic actions.

Diagnosis: The temporal difference learning for the value function is unstable, often due to a too-high learning rate or insufficient target network update frequency.

Fix: (1) Reduce the learning rate for the Q-function (try 3e-4 instead of 1e-3). (2) Increase the target network soft-update coefficient tau (from 0.01 to 0.005 for smoother updates). (3) Add layer normalization to the Q-network. TD-MPC2's default hyperparameters are generally stable, so if you are modifying them, revert to defaults first.

Insufficient Data Diversity

Symptom: The world model makes accurate predictions on training-like states but fails catastrophically on slightly different initial configurations.

Diagnosis: The dataset lacks sufficient variation in initial states, object configurations, or environmental conditions.

Fix: This is a data problem, not a model problem. Collect more episodes with deliberately randomized initial conditions. SVRC's data collection service follows structured randomization protocols specifically designed to maximize state-space coverage for world model training. As a rule of thumb, your dataset should cover 3-5x the range of variation you expect at deployment.

Related Reading