A Difficulty Scale for Dexterous Manipulation Tasks

Why a Difficulty Scale Matters

The single most common failure mode in robot learning projects is misaligned expectations: a team allocates 2 months and 500 demonstrations for a task that requires 6 months and 5,000 demonstrations. These failures are largely preventable with better upfront task characterization.

The difficulty scale below is designed to answer three questions before any data collection begins: How much data does this task require? Which algorithm should we use? What is a realistic timeline? It is based on empirical results from the manipulation learning literature and SVRC's own deployment experience.

The 5-Level Difficulty Framework

Level 1 — Unconstrained Top Grasp

The simplest class of manipulation tasks. The robot must grasp a single rigid object from a fixed or near-fixed position using a top-down approach. No contact constraints during approach, large grasp tolerance (>5 mm), single object, stable background.

Example tasks: Picking a soup can from a conveyor, transferring a block from bin to tray, picking a mug from a table surface.
Typical demonstrations needed: 200–500
Recommended algorithm: Behavior Cloning (BC) or ACT with H=50
Expected success rate after training: 85–95%
Key failure modes: Lighting changes, novel background, object off-center by >8 cm

Level 2 — Varied Grasps with Occlusion

Multiple valid grasp strategies, partial occlusion, varied object orientation, or modest background clutter. The robot must select from several viable approaches rather than executing a single canonical motion.

Example tasks: Picking items from a cluttered bin, grasping a bottle from any orientation, pouring from a pitcher.
Typical demonstrations needed: 500–2,000
Recommended algorithm: ACT or Diffusion Policy
Expected success rate: 75–88%
Key challenge: Policy must handle multi-modality — multiple valid grasps must be preserved, not averaged.

Level 3 — Multi-Step Tasks and Tool Use

Tasks requiring two or more sequential manipulation steps, tool use (scissors, screwdriver), or reorientation of an object between grasp phases. Error compounding across steps is the dominant challenge.

Example tasks: Open a drawer then retrieve an item, use a spatula to flip food, screw a cap onto a bottle.
Typical demonstrations needed: 2,000–10,000
Recommended algorithm: ACT with temporal ensemble, or hierarchical policy
Expected success rate: 60–80%
Key challenge: Per-step success rates compound — two steps at 85% each = 72% full task success.

Level 4 — Contact-Rich Assembly

Tasks requiring precision contact and force control: peg insertion, snap-fit assembly, connector mating, drawer/door manipulation. Position precision requirements <3 mm and precise force control are characteristic of this level.

Example tasks: USB connector insertion, PCB component placement, peg-in-hole (±1 mm tolerance), assembly with snap fits.
Typical demonstrations needed: 10,000–50,000
Recommended algorithm: Diffusion Policy (Transformer), ACT with wrist camera
Expected success rate: 50–75%
Key challenge: Camera pixel resolution noise (1 pixel ≈ 0.5–1 mm at typical distances) is comparable to required precision. Wrist camera is near-mandatory.

Level 5 — Deformable Objects and Bimanual Coordination

The hardest class of tasks: deformable materials (cloth, dough, cables), bimanual coordination requiring synchronized dual-arm trajectories, or tasks with very long horizons (>60 seconds) combining multiple L3/L4 subtasks.

Example tasks: Cloth folding, cable routing, surgical suturing, sandwich assembly with spreading.
Typical demonstrations needed: 50,000+, or foundation model fine-tuning
Recommended algorithm: π0 (Physical Intelligence) for bimanual; fine-tuned OpenVLA for long-horizon; custom diffusion for deformables
Expected success rate: 40–65%
Key challenge: Deformable state space is infinite-dimensional. Current robot learning approaches approximate rather than solve these tasks.

Task Taxonomy Reference Table

Task	Difficulty Level	Typical Demos	Algorithm	Notes
Top-down bin pick (uniform)	L1	200–500	BC	Fixed SKU, structured environment
Mixed bin picking	L2	500–2,000	ACT	Variable object pose, clutter
Bottle cap screwing	L3	2,000–8,000	ACT	Force + rotation coordination
Drawer open + retrieve	L3	3,000–10,000	ACT / Hierarchical	Two sub-tasks chained
USB insertion	L4	10,000–30,000	Diffusion-T	Precision ±2 mm
PCB component place	L4	20,000–50,000	Diffusion-T + wrist cam	±0.5 mm precision
T-shirt folding	L5	50,000+ or FM	π0 / Diffusion	Deformable, bimanual
Cable routing through clips	L5	30,000–80,000	Custom	High variance, contact-rich

Factors That Raise Difficulty

Precision requirement: Each order of magnitude increase in required precision (10 mm → 1 mm → 0.1 mm) roughly doubles the data requirement and may require algorithm changes.
Occlusion: When the robot arm or gripper occludes the camera view during critical action phases, data requirements increase significantly. Wrist cameras partially mitigate this.
Two-hand coordination: Any task requiring synchronized dual-arm motion is automatically L4/L5 due to the exponential state space and joint timing requirements.
State variability: If the environment changes between demonstrations (different drawer position, different object count in bin), the data requirement grows with the dimensionality of the variability.

Use this scale when scoping projects with the SVRC data services team. For tasks at L3–L5, we recommend a paid scoping consultation before committing to a large data collection program.