Why a Difficulty Scale Matters

The single most common failure mode in robot learning projects is misaligned expectations: a team allocates 2 months and 500 demonstrations for a task that requires 6 months and 5,000 demonstrations. These failures are largely preventable with better upfront task characterization.

The difficulty scale below is designed to answer three questions before any data collection begins: How much data does this task require? Which algorithm should we use? What is a realistic timeline? It is based on empirical results from the manipulation learning literature and SVRC's own deployment experience.

The 5-Level Difficulty Framework

Level 1 — Unconstrained Top Grasp

The simplest class of manipulation tasks. The robot must grasp a single rigid object from a fixed or near-fixed position using a top-down approach. No contact constraints during approach, large grasp tolerance (>5 mm), single object, stable background.

  • Example tasks: Picking a soup can from a conveyor, transferring a block from bin to tray, picking a mug from a table surface.
  • Typical demonstrations needed: 200–500
  • Recommended algorithm: Behavior Cloning (BC) or ACT with H=50
  • Expected success rate after training: 85–95%
  • Key failure modes: Lighting changes, novel background, object off-center by >8 cm

Level 2 — Varied Grasps with Occlusion

Multiple valid grasp strategies, partial occlusion, varied object orientation, or modest background clutter. The robot must select from several viable approaches rather than executing a single canonical motion.

  • Example tasks: Picking items from a cluttered bin, grasping a bottle from any orientation, pouring from a pitcher.
  • Typical demonstrations needed: 500–2,000
  • Recommended algorithm: ACT or Diffusion Policy
  • Expected success rate: 75–88%
  • Key challenge: Policy must handle multi-modality — multiple valid grasps must be preserved, not averaged.

Level 3 — Multi-Step Tasks and Tool Use

Tasks requiring two or more sequential manipulation steps, tool use (scissors, screwdriver), or reorientation of an object between grasp phases. Error compounding across steps is the dominant challenge.

  • Example tasks: Open a drawer then retrieve an item, use a spatula to flip food, screw a cap onto a bottle.
  • Typical demonstrations needed: 2,000–10,000
  • Recommended algorithm: ACT with temporal ensemble, or hierarchical policy
  • Expected success rate: 60–80%
  • Key challenge: Per-step success rates compound — two steps at 85% each = 72% full task success.

Level 4 — Contact-Rich Assembly

Tasks requiring precision contact and force control: peg insertion, snap-fit assembly, connector mating, drawer/door manipulation. Position precision requirements <3 mm and precise force control are characteristic of this level.

  • Example tasks: USB connector insertion, PCB component placement, peg-in-hole (±1 mm tolerance), assembly with snap fits.
  • Typical demonstrations needed: 10,000–50,000
  • Recommended algorithm: Diffusion Policy (Transformer), ACT with wrist camera
  • Expected success rate: 50–75%
  • Key challenge: Camera pixel resolution noise (1 pixel ≈ 0.5–1 mm at typical distances) is comparable to required precision. Wrist camera is near-mandatory.

Level 5 — Deformable Objects and Bimanual Coordination

The hardest class of tasks: deformable materials (cloth, dough, cables), bimanual coordination requiring synchronized dual-arm trajectories, or tasks with very long horizons (>60 seconds) combining multiple L3/L4 subtasks.

  • Example tasks: Cloth folding, cable routing, surgical suturing, sandwich assembly with spreading.
  • Typical demonstrations needed: 50,000+, or foundation model fine-tuning
  • Recommended algorithm: π0 (Physical Intelligence) for bimanual; fine-tuned OpenVLA for long-horizon; custom diffusion for deformables
  • Expected success rate: 40–65%
  • Key challenge: Deformable state space is infinite-dimensional. Current robot learning approaches approximate rather than solve these tasks.

Task Taxonomy Reference Table

TaskDifficulty LevelTypical DemosAlgorithmNotes
Top-down bin pick (uniform)L1200–500BCFixed SKU, structured environment
Mixed bin pickingL2500–2,000ACTVariable object pose, clutter
Bottle cap screwingL32,000–8,000ACTForce + rotation coordination
Drawer open + retrieveL33,000–10,000ACT / HierarchicalTwo sub-tasks chained
USB insertionL410,000–30,000Diffusion-TPrecision ±2 mm
PCB component placeL420,000–50,000Diffusion-T + wrist cam±0.5 mm precision
T-shirt foldingL550,000+ or FMπ0 / DiffusionDeformable, bimanual
Cable routing through clipsL530,000–80,000CustomHigh variance, contact-rich

Factors That Raise Difficulty

  • Precision requirement: Each order of magnitude increase in required precision (10 mm → 1 mm → 0.1 mm) roughly doubles the data requirement and may require algorithm changes.
  • Occlusion: When the robot arm or gripper occludes the camera view during critical action phases, data requirements increase significantly. Wrist cameras partially mitigate this.
  • Two-hand coordination: Any task requiring synchronized dual-arm motion is automatically L4/L5 due to the exponential state space and joint timing requirements.
  • State variability: If the environment changes between demonstrations (different drawer position, different object count in bin), the data requirement grows with the dimensionality of the variability.

Use this scale when scoping projects with the SVRC data services team. For tasks at L3–L5, we recommend a paid scoping consultation before committing to a large data collection program.