Why Multi-Task Learning
Single-task robot policies have a fundamental limitation: they train only on data from one task, wasting the potential for shared visual understanding across related tasks. A policy trained only to pick red cups never learns that "picking" is a general skill applicable to blue mugs, green bottles, and metal cans.
Multi-task learning addresses this by co-training on demonstrations from multiple related tasks simultaneously. The benefits compound:
- Shared visual features — tasks sharing a workspace share visual semantics. A representation trained on "pick cup," "pick bottle," and "pick bowl" learns "graspable object" as a concept. Single-task training learns only "cup."
- More data per training run — 10 tasks × 200 demos each = 2,000 training examples total, all contributing to a shared representation. Each individual task effectively has access to all 2,000 examples for feature learning.
- Better foundation for new tasks — a multi-task policy adapts to genuinely new tasks faster than a single-task policy, because the representation is already richer.
Task Conditioning Approaches
A multi-task policy must know which task to perform. Three primary conditioning approaches:
- Language instruction — most flexible. Policy receives text like "pick up the red cup and place it in the bin." Uses a language encoder (T5, CLIP text encoder) to produce a task embedding. Enables zero-shot generalization to novel instructions. Used by OpenVLA, RT-2, SayCan.
- One-hot task ID — simplest. Policy receives a task index (task 0, task 1, ..., task N). No language understanding needed. Works well when task set is fixed and small (<20 tasks). Used in many offline RL papers.
- Goal image — policy receives an image of the desired final state. No language needed; works for tasks hard to describe verbally. Limitations: collecting goal images is labor-intensive, and the policy must infer what action produces the goal, not just what the goal looks like.
Co-Training Benefits: Evidence from Published Work
The empirical evidence for multi-task co-training is strong:
- Open X-Embodiment — training on the full diverse dataset (22 embodiments, 527 tasks) consistently outperforms training on a single dataset, even for tasks where the single-task dataset is large. Diverse tasks → better generalization. Key finding: more diverse is better, even when many tasks seem unrelated.
- ALOHA bimanual tasks — co-training 25 bimanual tasks (folding, packing, assembly) showed that a shared encoder improves performance on all tasks compared to separate per-task models. The improvement is largest for rare tasks with few demonstrations: a task with 20 demos performs as well in multi-task training as a task with 80 demos in single-task training.
- Bridge V2 + other datasets — adding Bridge V2 data to fine-tuning of OpenVLA consistently improves novel-object generalization by 10–20 percentage points vs. fine-tuning on only the target dataset. Even when Bridge V2 tasks differ from the target task.
Negative Transfer: When Task Sharing Hurts
Not all task combinations benefit from sharing. Negative transfer occurs when training on multiple tasks together produces a policy worse than any individual single-task policy.
- Conflicting action distributions — "wipe table" (broad sweeping motions) and "pick object" (precise positioning) have fundamentally different action distributions. A shared policy head tries to average these, producing trajectories good for neither.
- Conflicting visual representations — tasks requiring attention to fine-grained texture (sewing, electronics) conflict with tasks requiring global scene understanding (navigation, bin picking).
- Different hardware setups — co-training data from a parallel jaw gripper with data from a multi-fingered hand produces confused action sequences. Keep hardware-specific data separate.
- Rule of thumb: if two tasks share the same end-effector and operate on objects in the same workspace, they almost always benefit from co-training. If they require fundamentally different motions or sensors, evaluate carefully.
Architecture Choices
- Single shared encoder + task-specific heads — most common. Shared ResNet or ViT encoder extracts visual features; per-task MLP heads predict actions. Good when tasks share visual context but differ in action distribution.
- Mixture of experts — N expert networks, each specializing in a subset of tasks. A gating network routes input to the appropriate expert. Reduces negative transfer for conflicting tasks. Used in some multi-task RL work.
- Hypernetwork / task-conditioned weight generation — meta-learning approach where a hypernetwork generates task-specific policy weights given a task embedding. Most flexible but most complex to train.
- For most practical multi-task manipulation work, a shared ViT encoder with language task conditioning and a diffusion action head (as in Octo) is the current best practice.
Data Mixing Strategies
- Task-proportional sampling — sample each task in proportion to its dataset size. Largest datasets dominate training. Risk: rare tasks with few demos are underrepresented and may not improve.
- Oversample rare tasks — uniform task sampling (equal probability per task regardless of size) over-weights rare tasks, preventing them from being drowned out. Often outperforms proportional sampling.
- Curriculum mixing — start with simpler tasks, gradually introduce harder ones. Mirrors human learning: foundation skills before complex skills. Particularly useful when task difficulty spans a wide range.
- Difficulty-weighted sampling — sample tasks in inverse proportion to current policy performance. Under-performing tasks get more data. Adaptive and can improve convergence, but requires online performance estimates.
Evaluating Multi-Task Policies
Held-out task generalization is the gold standard: train on N tasks, evaluate on M tasks not seen during training. This measures whether the policy learned transferable skills or merely memorized individual tasks.
Published results on held-out task generalization: policies trained with language conditioning on 25+ tasks typically achieve 40–70% success on genuinely novel tasks with similar affordances. Single-task policies achieve near-zero on these same novel tasks. The gap represents the value of multi-task training.
Practical Guidance
- Always co-train if you have more than 3 related tasks and the same hardware/end-effector. The overhead is minimal and the benefit is consistent.
- Use language conditioning even if you never intend to use language at deployment — it forces better task representation and simplifies data collection (just describe each task verbally).
- Evaluate on held-out objects from the start. If per-task performance is much higher than cross-task generalization, you are memorizing rather than learning.
- Separate models are justified only when tasks require genuinely different hardware, sensors, or action modalities.
Explore multi-task data collection tools on the SVRC data services page, or access pre-collected multi-task datasets via the research platform.