The Sample Efficiency Problem

How many demonstrations does it take to train a robot policy? The honest answer is: it depends — and the range spans four orders of magnitude. A simple single-object grasp can be learned from 50 demonstrations. Training a vision-language-action model from scratch requires 50,000 or more. Misjudging this before starting a data collection program leads to one of two failure modes: under-collection (policy never converges) or over-collection (budget wasted on redundant data).

This analysis synthesizes empirical results from published robot learning papers and SVRC's internal collection programs to give practical guidance on demo budgeting.

Empirical Data: Demos Required by Algorithm and Task

The following ranges come from published results with standardized evaluation protocols. "Success rate" refers to the primary task metric reported in each work.

AlgorithmTask TypeDemos for ~70% SRDemos for ~85% SRNotes
Behavioral Cloning (ResNet)Simple single-object grasp50–100150–200Plateaus early; limited generalization
ACT (Action Chunking)Bimanual pick-and-place50–150200–500Strong for precise, short-horizon tasks
Diffusion PolicyPrecision assembly200–500800–2000Best for multi-modal behavior
Foundation model fine-tuneNovel grasp variants20–100200–500Requires pretrained visual encoder
OpenVLA fine-tuneLanguage-conditioned manipulation50–200300–800Generalizes across objects with diverse data
From-scratch VLABroad manipulation suite20,000+50,000+Not practical per-task; amortized across tasks

Key takeaway: foundation model fine-tuning (starting from a pretrained VLA or visual encoder) is 3–10× more sample efficient than training from scratch on the same task. If your task has any visual similarity to tasks covered by Open X-Embodiment or OXE-derived models, fine-tuning is almost always the right starting point.

Task Difficulty Multipliers

The figures above assume Level 1 task difficulty. Real tasks are almost never Level 1. Apply these multipliers to estimate demo requirements for your specific task:

LevelTask DescriptionDemo MultiplierExample
L1Simple, fixed-position, single objectPick block from fixed location
L2Variable position/orientation, single object2–5×Pick block from random position in 10cm radius
L3Multi-step, 2–3 subtasks5–20×Pick block, place on target, press confirm button
L4Contact-rich, precision required20–100×Peg insertion, drawer opening, plug connection
L5Deformable objects or bimanual dexterous100×+Fold cloth, bimanual tool use

A task that looks like "simple grasping" but requires working across 50 different object types with varied geometry is an L2–L3 task, not L1. Most industrial manipulation tasks are L2–L3. Dexterous hand tasks are almost always L4–L5.

Algorithm Efficiency Comparison

AlgorithmRelative Sample EfficiencyInference SpeedBest For
BC-RNNLow (1×)Fast (< 5ms)Simple tasks, constrained compute
ACTHigh (5–10×)Medium (10–20ms)Precise bimanual, short-horizon
Diffusion PolicyMedium (3–7×)Slow (50–200ms)Multi-modal, contact-rich
π0 / foundation model fine-tuneVery high (10–20×)Slow (100–500ms)Broad generalization, novel objects
From-scratch CNN-MLPVery low (0.5×)Very fast (< 2ms)Constrained robots, simple tasks only

Practical Budgeting Guide

A systematic approach to demo budgeting before committing to a full collection program:

  • Step 1 — Pilot collection: Collect exactly 200 demonstrations. This is the minimum viable dataset for any non-trivial task with a modern algorithm.
  • Step 2 — Train and measure: Train your target algorithm on the 200 demos. Measure success rate on a held-out evaluation set of at least 50 trials.
  • Step 3 — Learning curve extrapolation: If 200 demos gives you S% success rate and your target is T%, fit a logarithmic curve to estimate the total demos needed. A rough rule: from the 200-demo baseline, each doubling of the dataset produces roughly 60–70% of the remaining gap to theoretical maximum.
  • Step 4 — Algorithm reassessment: If your 200-demo success rate is below 40%, consider switching algorithms or adding a pretrained visual encoder before investing in more data. A weak signal at 200 demos usually indicates an algorithm-task mismatch, not a data quantity problem.
  • Step 5 — Budget for the flywheel: Plan initial collection at 50–60% of estimated total requirement. Reserve the remaining budget for failure-targeted collection after initial deployment. Failure-targeted data is 2–5× more efficient than random collection for closing the final performance gap.

Diminishing Returns: The Knee of the Learning Curve

Every learning curve has a "knee" — the inflection point beyond which marginal returns drop sharply. In our empirical data across 30+ collection programs, the knee consistently appears at 70–80% of the task's empirical maximum success rate.

Below the knee, each new batch of 100 demonstrations produces a 3–8% improvement in success rate. Above the knee, each 5% improvement requires 3–10× as many demonstrations as the previous 5% improvement.

This has a direct implication for program management: if your success rate is above 85% and your target is 90%, you are operating above the knee. Budget accordingly — expect to need 500–1,500 targeted demonstrations to close that 5-point gap, not 100.

For tasks where 90%+ is required (safety-critical manipulation, medical device assembly), the cost of the final 5–10% of performance is frequently higher than the cost of reaching 85%. Plan for this from the start.

SVRC's data collection services include learning curve estimation as part of every program kickoff. Our Fearless Platform tracks your learning curve in real time, automatically flagging when you cross the knee and activating failure-targeted collection mode.