Why Curriculum Matters
Starting a robot learning program with complex tasks is the most common — and most expensive — mistake teams make. A policy trained directly on a complex task like bimanual cloth folding from scratch will fail to learn anything useful in the first 2,000 demonstrations. The task space is too large, the reward signal too sparse, and the network has no prior skills to build on.
A well-designed curriculum sequences tasks from primitive skills to complex compositions, where each stage builds on skills acquired in the previous stage. The result: policies that learn faster, generalize better, and require fewer total demonstrations. Teams using curriculum learning at SVRC have reduced total demonstration requirements by 40–60% for complex manipulation tasks compared to direct end-to-end training.
Curriculum Design Principles
- Start with primitive skills (reaching, grasping): A "reach to object" policy and a "grasp stable object" policy are the foundation for virtually every manipulation task. Train these to high competence (>95% success) before composing them into multi-step tasks.
- Compose into complex tasks: Once primitive skills are trained, complex tasks are learned much faster because the policy already has the relevant perceptual and motor sub-skills. A "place cup in tray" task trained after reaching and grasping policies exist needs only 300–500 demonstrations to learn the composition; training it from scratch requires 2,000–5,000.
- Reuse demonstrations across tasks: Demonstrations of primitive skills can be reused as pretraining data for all tasks that require those skills. This amortizes the cost of collecting high-quality primitive demonstrations.
- Measure primitive skill quality strictly: A primitive skill that only works 80% of the time will compound badly in a composed task. Target >95% success on each primitive before advancing to composition.
Task Difficulty Levels L1–L5
| Level | Task Description | Examples | Demo Requirement | Key Challenge |
|---|---|---|---|---|
| L1 — Primitive single-step | Top grasp of flat, stationary object in fixed position | Pick flat block, open/close gripper on fixed peg | 50–200 | Precise approach trajectory |
| L2 — Varied grasp | Grasp objects with varied size, position, or orientation | Pick cylinder of varying diameter, grasp in ±30° orientation range | 500–1,000 | Visual generalization |
| L3 — Two-step manipulation | Sequential actions with state dependency | Pick and place, open box lid then insert item | 1,000–5,000 | State estimation between steps |
| L4 — Contact-rich assembly | Precise insertion, assembly under uncertainty | USB plug insertion, snap-fit assembly, nut threading | 5,000–20,000 | Reactive force control |
| L5 — Bimanual deformable | Two arms, deformable objects, long horizons | Fold towel, bag groceries, cut with knife and fork | 20,000+ | Bimanual coordination, deformable state |
These ranges assume modern transformer-based imitation learning architectures (ACT, Diffusion Policy). Older behavior cloning methods require 2–5× more demonstrations for the same task. If you are using a newer architecture like π0 or a large pre-trained vision-language-action model, the demonstration requirements at L1–L3 may be 3–10× lower due to pre-trained priors.
Skill Decomposition Example
Consider a cup stacking task (L3): pick up a cup and stack it on a second cup. This task decomposes into three reusable skills:
- Reach skill: Move end-effector to within 2 cm of the cup handle area, with correct approach angle. Trained independently with 100–200 demos. Reusable for any cup-grasping task.
- Grasp skill: Close gripper on cup from reach pose, verify grasp success via gripper position feedback. Trained with 50–100 demos (initialized from reach skill weights). Reusable for all cup handling tasks.
- Place skill: Lower cup onto target cup, align centers, release gripper. The novel skill in cup stacking. Requires 300–500 demos given reach+grasp pretraining. Without pretraining: 1,500–2,500 demos.
Total with curriculum: 450–800 demonstrations. Total without curriculum: 2,000–5,000 demonstrations. The curriculum reduces data requirements by 4–6× for this task.
Transfer Learning Between Tasks
Modern policy architectures share a visual encoder across the observation space. This encoder learns to extract task-relevant features from images. When you train the encoder on a variety of tasks through curriculum, it develops richer representations that transfer more effectively to novel tasks.
- Shared visual encoder: Train a single ResNet-18 or ViT-small encoder on all curriculum tasks jointly. This encoder learns features like "object edge," "gripper proximity," and "grasp contact" that are useful across many tasks.
- 3× sample efficiency with curriculum: Empirically, a curriculum-pretrained visual encoder reduces the demonstration requirement for a novel L3 task from 2,000 demos to approximately 600–700 demos — roughly a 3× improvement in sample efficiency.
- Fine-tuning strategy: For a new task, freeze the encoder weights from the curriculum (or use a low learning rate), and train only the action decoder on the new task's demonstrations. This prevents catastrophic forgetting of previous skills while learning the new task.
When to Skip Curriculum
Curriculum design takes time and only pays off for complex tasks. Not every project warrants it:
- L1–L2 tasks: go direct. Simple single-step or varied-grasp tasks can be trained end-to-end in 100–1,000 demos. The overhead of designing a curriculum exceeds the savings.
- L3 tasks: situation-dependent. If you have a clear skill decomposition and already have primitive skill data from another project, use curriculum. If starting from scratch for a single L3 task, direct training is often faster.
- L4–L5 tasks: curriculum is essential. Attempting to train contact-rich assembly or bimanual tasks directly from scratch without curriculum is expensive and usually unsuccessful. At this level, the curriculum design effort is well-justified.
Data Collection Order
Collect data in curriculum order — do not collect all task data before beginning training. This allows you to use earlier task policies as a starting point for collecting harder task demonstrations:
- Collect L1 data first: Train L1 policies to >95% success. These policies can now autonomously collect approach trajectories for L2/L3 data collection (operator corrects near contact, not full approach).
- Use trained policies for data augmentation: A trained reaching policy can autonomously execute the approach phase while a human operator teleoperation from the grasp phase onward. This reduces operator cognitive load and produces more consistent demonstrations.
- Prioritize consistency over speed in early stages: L1 and L2 demonstrations should be executed slowly and deliberately. The policy will learn the approximate speed from the data — slow, clear demonstrations are more informative than fast ones that are harder to decompose.