The Quick Decision Rule
Before the detailed analysis: if you can easily demonstrate the task with a teleoperation system, use imitation learning. If you cannot demonstrate it but can define a clear scalar reward signal, use RL. If neither applies, the task is too complex — start with a simpler sub-task.
This rule is correct for 80% of practical manipulation scenarios. The remaining 20% require hybrid approaches or careful analysis of the tradeoffs below.
Imitation Learning: Strengths and Limits
IL advantages that are often undervalued:
- Fast development cycle: 100 demonstrations → trained policy in 2-4 hours. Full iteration cycle (collect, train, evaluate, collect more) can complete in a day.
- Safe by construction: The policy learns to stay near demonstrated behavior. It will not discover dangerous edge cases the way an RL explorer might.
- No reward engineering: Defining a reward function for precision assembly is surprisingly difficult. IL bypasses this entirely.
- Predictable cost: The cost per performance improvement is linear and measurable — collect 100 more demos, expect N% improvement.
IL limits:
- Bounded by demonstrator quality: The policy cannot exceed the skill of the operators who provided demonstrations.
- Compounding error: Behavioral cloning drifts from the training distribution over long time horizons.
- Cannot discover non-obvious solutions: If there is a strategy that is hard to demonstrate but easy to execute (e.g., using inertial dynamics to fling an object into a bin), IL will never find it.
Reinforcement Learning: Strengths and Limits
RL advantages:
- Can exceed human performance: RL discovered superhuman strategies in games, and has found non-intuitive but superior grasping strategies for specific objects.
- Handles non-demonstrable behaviors: Tasks where the human cannot easily control the robot (high-speed dynamics, complex contact sequences) are natural RL targets.
- Long-horizon planning: With well-shaped rewards, RL can handle 20-50 step tasks that are prone to compounding error in IL.
RL limits that are often under-estimated:
- Sample efficiency: Modern manipulation RL requires 500K-5M environment steps for simple tasks. At 10Hz simulation, that is 14-140 hours of simulated time — and orders of magnitude more data than IL.
- Reward engineering difficulty: Sparse rewards (success/failure only) rarely work without careful shaping. Dense reward functions for manipulation are hard to specify correctly. Incorrectly shaped rewards produce policies that exploit the reward rather than solve the task.
- Sim-to-real dependency: Real-world RL on physical robots is possible but slow and expensive. RL almost always requires simulation, which requires sim-to-real transfer.
Hybrid Approaches: The Best of Both
| Approach | Description | Benefit | When to Use |
|---|---|---|---|
| IL warm-start + RL fine-tuning | Train IL policy first, use as RL initialization | 10× more efficient RL | When IL policy is 60-70% and you need 90%+ |
| GAIL | RL with reward from discriminator trained on demos | No reward engineering | When reward is hard to specify |
| DAgger | IL with online data collection from expert | Fixes compounding error | When BC diverges at test time |
| Residual RL | Base IL policy + RL residual corrector | Safe exploration near IL behavior | Precision refinement of IL policies |
Task-Type Recommendations
| Task Type | Recommended Approach | Reason |
|---|---|---|
| Pick-and-place (varied poses) | IL (ACT or Diffusion) | Easily demonstrated, contact minimal |
| Precision peg insertion | IL or IL + Residual RL | Demonstrable; RL refines final alignment |
| Dexterous in-hand manipulation | RL (in sim) + IL for initialization | Hard to demonstrate; RL finds solutions |
| Long-horizon assembly (10+ steps) | Hierarchical: task planner + IL per step | Neither alone handles long horizon well |
| Locomotion gait optimization | RL | Non-demonstrable, clear energy reward |
| Tool use (novel tools) | IL for known tools, RL generalization | Few-shot IL + sim RL exploration |
SVRC's RL environment service provides simulation setups for RL training, and our data services handle IL demonstration collection — often used in combination for the hybrid approaches described above.