Sample Efficiency in Robot Policy Learning: How Many Demos Do You Really Need?

The Sample Efficiency Problem

How many demonstrations does it take to train a robot policy? The honest answer is: it depends — and the range spans four orders of magnitude. A simple single-object grasp can be learned from 50 demonstrations. Training a vision-language-action model from scratch requires 50,000 or more. Misjudging this before starting a data collection program leads to one of two failure modes: under-collection (policy never converges) or over-collection (budget wasted on redundant data).

This analysis synthesizes empirical results from published robot learning papers and SVRC's internal collection programs to give practical guidance on demo budgeting.

Empirical Data: Demos Required by Algorithm and Task

The following ranges come from published results with standardized evaluation protocols. "Success rate" refers to the primary task metric reported in each work.

Algorithm	Task Type	Demos for ~70% SR	Demos for ~85% SR	Notes
Behavioral Cloning (ResNet)	Simple single-object grasp	50–100	150–200	Plateaus early; limited generalization
ACT (Action Chunking)	Bimanual pick-and-place	50–150	200–500	Strong for precise, short-horizon tasks
Diffusion Policy	Precision assembly	200–500	800–2000	Best for multi-modal behavior
Foundation model fine-tune	Novel grasp variants	20–100	200–500	Requires pretrained visual encoder
OpenVLA fine-tune	Language-conditioned manipulation	50–200	300–800	Generalizes across objects with diverse data
From-scratch VLA	Broad manipulation suite	20,000+	50,000+	Not practical per-task; amortized across tasks

Key takeaway: foundation model fine-tuning (starting from a pretrained VLA or visual encoder) is 3–10× more sample efficient than training from scratch on the same task. If your task has any visual similarity to tasks covered by Open X-Embodiment or OXE-derived models, fine-tuning is almost always the right starting point.

Task Difficulty Multipliers

The figures above assume Level 1 task difficulty. Real tasks are almost never Level 1. Apply these multipliers to estimate demo requirements for your specific task:

Level	Task Description	Demo Multiplier	Example
L1	Simple, fixed-position, single object	1×	Pick block from fixed location
L2	Variable position/orientation, single object	2–5×	Pick block from random position in 10cm radius
L3	Multi-step, 2–3 subtasks	5–20×	Pick block, place on target, press confirm button
L4	Contact-rich, precision required	20–100×	Peg insertion, drawer opening, plug connection
L5	Deformable objects or bimanual dexterous	100×+	Fold cloth, bimanual tool use

A task that looks like "simple grasping" but requires working across 50 different object types with varied geometry is an L2–L3 task, not L1. Most industrial manipulation tasks are L2–L3. Dexterous hand tasks are almost always L4–L5.

Algorithm Efficiency Comparison

Algorithm	Relative Sample Efficiency	Inference Speed	Best For
BC-RNN	Low (1×)	Fast (< 5ms)	Simple tasks, constrained compute
ACT	High (5–10×)	Medium (10–20ms)	Precise bimanual, short-horizon
Diffusion Policy	Medium (3–7×)	Slow (50–200ms)	Multi-modal, contact-rich
π0 / foundation model fine-tune	Very high (10–20×)	Slow (100–500ms)	Broad generalization, novel objects
From-scratch CNN-MLP	Very low (0.5×)	Very fast (< 2ms)	Constrained robots, simple tasks only

Practical Budgeting Guide

A systematic approach to demo budgeting before committing to a full collection program:

Step 1 — Pilot collection: Collect exactly 200 demonstrations. This is the minimum viable dataset for any non-trivial task with a modern algorithm.
Step 2 — Train and measure: Train your target algorithm on the 200 demos. Measure success rate on a held-out evaluation set of at least 50 trials.
Step 3 — Learning curve extrapolation: If 200 demos gives you S% success rate and your target is T%, fit a logarithmic curve to estimate the total demos needed. A rough rule: from the 200-demo baseline, each doubling of the dataset produces roughly 60–70% of the remaining gap to theoretical maximum.
Step 4 — Algorithm reassessment: If your 200-demo success rate is below 40%, consider switching algorithms or adding a pretrained visual encoder before investing in more data. A weak signal at 200 demos usually indicates an algorithm-task mismatch, not a data quantity problem.
Step 5 — Budget for the flywheel: Plan initial collection at 50–60% of estimated total requirement. Reserve the remaining budget for failure-targeted collection after initial deployment. Failure-targeted data is 2–5× more efficient than random collection for closing the final performance gap.

Diminishing Returns: The Knee of the Learning Curve

Every learning curve has a "knee" — the inflection point beyond which marginal returns drop sharply. In our empirical data across 30+ collection programs, the knee consistently appears at 70–80% of the task's empirical maximum success rate.

Below the knee, each new batch of 100 demonstrations produces a 3–8% improvement in success rate. Above the knee, each 5% improvement requires 3–10× as many demonstrations as the previous 5% improvement.

This has a direct implication for program management: if your success rate is above 85% and your target is 90%, you are operating above the knee. Budget accordingly — expect to need 500–1,500 targeted demonstrations to close that 5-point gap, not 100.

For tasks where 90%+ is required (safety-critical manipulation, medical device assembly), the cost of the final 5–10% of performance is frequently higher than the cost of reaching 85%. Plan for this from the start.

SVRC's data collection services include learning curve estimation as part of every program kickoff. Our Fearless Platform tracks your learning curve in real time, automatically flagging when you cross the knee and activating failure-targeted collection mode.