Why a Difficulty Scale Matters
The single most common failure mode in robot learning projects is misaligned expectations: a team allocates 2 months and 500 demonstrations for a task that requires 6 months and 5,000 demonstrations. These failures are largely preventable with better upfront task characterization.
The difficulty scale below is designed to answer three questions before any data collection begins: How much data does this task require? Which algorithm should we use? What is a realistic timeline? It is based on empirical results from the manipulation learning literature and SVRC's own deployment experience.
The 5-Level Difficulty Framework
Level 1 — Unconstrained Top Grasp
The simplest class of manipulation tasks. The robot must grasp a single rigid object from a fixed or near-fixed position using a top-down approach. No contact constraints during approach, large grasp tolerance (>5 mm), single object, stable background.
- Example tasks: Picking a soup can from a conveyor, transferring a block from bin to tray, picking a mug from a table surface.
- Typical demonstrations needed: 200–500
- Recommended algorithm: Behavior Cloning (BC) or ACT with H=50
- Expected success rate after training: 85–95%
- Key failure modes: Lighting changes, novel background, object off-center by >8 cm
Level 2 — Varied Grasps with Occlusion
Multiple valid grasp strategies, partial occlusion, varied object orientation, or modest background clutter. The robot must select from several viable approaches rather than executing a single canonical motion.
- Example tasks: Picking items from a cluttered bin, grasping a bottle from any orientation, pouring from a pitcher.
- Typical demonstrations needed: 500–2,000
- Recommended algorithm: ACT or Diffusion Policy
- Expected success rate: 75–88%
- Key challenge: Policy must handle multi-modality — multiple valid grasps must be preserved, not averaged.
Level 3 — Multi-Step Tasks and Tool Use
Tasks requiring two or more sequential manipulation steps, tool use (scissors, screwdriver), or reorientation of an object between grasp phases. Error compounding across steps is the dominant challenge.
- Example tasks: Open a drawer then retrieve an item, use a spatula to flip food, screw a cap onto a bottle.
- Typical demonstrations needed: 2,000–10,000
- Recommended algorithm: ACT with temporal ensemble, or hierarchical policy
- Expected success rate: 60–80%
- Key challenge: Per-step success rates compound — two steps at 85% each = 72% full task success.
Level 4 — Contact-Rich Assembly
Tasks requiring precision contact and force control: peg insertion, snap-fit assembly, connector mating, drawer/door manipulation. Position precision requirements <3 mm and precise force control are characteristic of this level.
- Example tasks: USB connector insertion, PCB component placement, peg-in-hole (±1 mm tolerance), assembly with snap fits.
- Typical demonstrations needed: 10,000–50,000
- Recommended algorithm: Diffusion Policy (Transformer), ACT with wrist camera
- Expected success rate: 50–75%
- Key challenge: Camera pixel resolution noise (1 pixel ≈ 0.5–1 mm at typical distances) is comparable to required precision. Wrist camera is near-mandatory.
Level 5 — Deformable Objects and Bimanual Coordination
The hardest class of tasks: deformable materials (cloth, dough, cables), bimanual coordination requiring synchronized dual-arm trajectories, or tasks with very long horizons (>60 seconds) combining multiple L3/L4 subtasks.
- Example tasks: Cloth folding, cable routing, surgical suturing, sandwich assembly with spreading.
- Typical demonstrations needed: 50,000+, or foundation model fine-tuning
- Recommended algorithm: π0 (Physical Intelligence) for bimanual; fine-tuned OpenVLA for long-horizon; custom diffusion for deformables
- Expected success rate: 40–65%
- Key challenge: Deformable state space is infinite-dimensional. Current robot learning approaches approximate rather than solve these tasks.
Task Taxonomy Reference Table
| Task | Difficulty Level | Typical Demos | Algorithm | Notes |
|---|---|---|---|---|
| Top-down bin pick (uniform) | L1 | 200–500 | BC | Fixed SKU, structured environment |
| Mixed bin picking | L2 | 500–2,000 | ACT | Variable object pose, clutter |
| Bottle cap screwing | L3 | 2,000–8,000 | ACT | Force + rotation coordination |
| Drawer open + retrieve | L3 | 3,000–10,000 | ACT / Hierarchical | Two sub-tasks chained |
| USB insertion | L4 | 10,000–30,000 | Diffusion-T | Precision ±2 mm |
| PCB component place | L4 | 20,000–50,000 | Diffusion-T + wrist cam | ±0.5 mm precision |
| T-shirt folding | L5 | 50,000+ or FM | π0 / Diffusion | Deformable, bimanual |
| Cable routing through clips | L5 | 30,000–80,000 | Custom | High variance, contact-rich |
Factors That Raise Difficulty
- Precision requirement: Each order of magnitude increase in required precision (10 mm → 1 mm → 0.1 mm) roughly doubles the data requirement and may require algorithm changes.
- Occlusion: When the robot arm or gripper occludes the camera view during critical action phases, data requirements increase significantly. Wrist cameras partially mitigate this.
- Two-hand coordination: Any task requiring synchronized dual-arm motion is automatically L4/L5 due to the exponential state space and joint timing requirements.
- State variability: If the environment changes between demonstrations (different drawer position, different object count in bin), the data requirement grows with the dimensionality of the variability.
Use this scale when scoping projects with the SVRC data services team. For tasks at L3–L5, we recommend a paid scoping consultation before committing to a large data collection program.