The Quick Decision Rule

Before the detailed analysis: if you can easily demonstrate the task with a teleoperation system, use imitation learning. If you cannot demonstrate it but can define a clear scalar reward signal, use RL. If neither applies, the task is too complex — start with a simpler sub-task.

This rule is correct for 80% of practical manipulation scenarios. The remaining 20% require hybrid approaches or careful analysis of the tradeoffs below.

Imitation Learning: Strengths and Limits

IL advantages that are often undervalued:

  • Fast development cycle: 100 demonstrations → trained policy in 2-4 hours. Full iteration cycle (collect, train, evaluate, collect more) can complete in a day.
  • Safe by construction: The policy learns to stay near demonstrated behavior. It will not discover dangerous edge cases the way an RL explorer might.
  • No reward engineering: Defining a reward function for precision assembly is surprisingly difficult. IL bypasses this entirely.
  • Predictable cost: The cost per performance improvement is linear and measurable — collect 100 more demos, expect N% improvement.

IL limits:

  • Bounded by demonstrator quality: The policy cannot exceed the skill of the operators who provided demonstrations.
  • Compounding error: Behavioral cloning drifts from the training distribution over long time horizons.
  • Cannot discover non-obvious solutions: If there is a strategy that is hard to demonstrate but easy to execute (e.g., using inertial dynamics to fling an object into a bin), IL will never find it.

Reinforcement Learning: Strengths and Limits

RL advantages:

  • Can exceed human performance: RL discovered superhuman strategies in games, and has found non-intuitive but superior grasping strategies for specific objects.
  • Handles non-demonstrable behaviors: Tasks where the human cannot easily control the robot (high-speed dynamics, complex contact sequences) are natural RL targets.
  • Long-horizon planning: With well-shaped rewards, RL can handle 20-50 step tasks that are prone to compounding error in IL.

RL limits that are often under-estimated:

  • Sample efficiency: Modern manipulation RL requires 500K-5M environment steps for simple tasks. At 10Hz simulation, that is 14-140 hours of simulated time — and orders of magnitude more data than IL.
  • Reward engineering difficulty: Sparse rewards (success/failure only) rarely work without careful shaping. Dense reward functions for manipulation are hard to specify correctly. Incorrectly shaped rewards produce policies that exploit the reward rather than solve the task.
  • Sim-to-real dependency: Real-world RL on physical robots is possible but slow and expensive. RL almost always requires simulation, which requires sim-to-real transfer.

Hybrid Approaches: The Best of Both

ApproachDescriptionBenefitWhen to Use
IL warm-start + RL fine-tuningTrain IL policy first, use as RL initialization10× more efficient RLWhen IL policy is 60-70% and you need 90%+
GAILRL with reward from discriminator trained on demosNo reward engineeringWhen reward is hard to specify
DAggerIL with online data collection from expertFixes compounding errorWhen BC diverges at test time
Residual RLBase IL policy + RL residual correctorSafe exploration near IL behaviorPrecision refinement of IL policies

Task-Type Recommendations

Task TypeRecommended ApproachReason
Pick-and-place (varied poses)IL (ACT or Diffusion)Easily demonstrated, contact minimal
Precision peg insertionIL or IL + Residual RLDemonstrable; RL refines final alignment
Dexterous in-hand manipulationRL (in sim) + IL for initializationHard to demonstrate; RL finds solutions
Long-horizon assembly (10+ steps)Hierarchical: task planner + IL per stepNeither alone handles long horizon well
Locomotion gait optimizationRLNon-demonstrable, clear energy reward
Tool use (novel tools)IL for known tools, RL generalizationFew-shot IL + sim RL exploration

SVRC's RL environment service provides simulation setups for RL training, and our data services handle IL demonstration collection — often used in combination for the hybrid approaches described above.