The Sample Efficiency Problem
How many demonstrations does it take to train a robot policy? The honest answer is: it depends — and the range spans four orders of magnitude. A simple single-object grasp can be learned from 50 demonstrations. Training a vision-language-action model from scratch requires 50,000 or more. Misjudging this before starting a data collection program leads to one of two failure modes: under-collection (policy never converges) or over-collection (budget wasted on redundant data).
This analysis synthesizes empirical results from published robot learning papers and SVRC's internal collection programs to give practical guidance on demo budgeting.
Empirical Data: Demos Required by Algorithm and Task
The following ranges come from published results with standardized evaluation protocols. "Success rate" refers to the primary task metric reported in each work.
| Algorithm | Task Type | Demos for ~70% SR | Demos for ~85% SR | Notes |
|---|---|---|---|---|
| Behavioral Cloning (ResNet) | Simple single-object grasp | 50–100 | 150–200 | Plateaus early; limited generalization |
| ACT (Action Chunking) | Bimanual pick-and-place | 50–150 | 200–500 | Strong for precise, short-horizon tasks |
| Diffusion Policy | Precision assembly | 200–500 | 800–2000 | Best for multi-modal behavior |
| Foundation model fine-tune | Novel grasp variants | 20–100 | 200–500 | Requires pretrained visual encoder |
| OpenVLA fine-tune | Language-conditioned manipulation | 50–200 | 300–800 | Generalizes across objects with diverse data |
| From-scratch VLA | Broad manipulation suite | 20,000+ | 50,000+ | Not practical per-task; amortized across tasks |
Key takeaway: foundation model fine-tuning (starting from a pretrained VLA or visual encoder) is 3–10× more sample efficient than training from scratch on the same task. If your task has any visual similarity to tasks covered by Open X-Embodiment or OXE-derived models, fine-tuning is almost always the right starting point.
Task Difficulty Multipliers
The figures above assume Level 1 task difficulty. Real tasks are almost never Level 1. Apply these multipliers to estimate demo requirements for your specific task:
| Level | Task Description | Demo Multiplier | Example |
|---|---|---|---|
| L1 | Simple, fixed-position, single object | 1× | Pick block from fixed location |
| L2 | Variable position/orientation, single object | 2–5× | Pick block from random position in 10cm radius |
| L3 | Multi-step, 2–3 subtasks | 5–20× | Pick block, place on target, press confirm button |
| L4 | Contact-rich, precision required | 20–100× | Peg insertion, drawer opening, plug connection |
| L5 | Deformable objects or bimanual dexterous | 100×+ | Fold cloth, bimanual tool use |
A task that looks like "simple grasping" but requires working across 50 different object types with varied geometry is an L2–L3 task, not L1. Most industrial manipulation tasks are L2–L3. Dexterous hand tasks are almost always L4–L5.
Algorithm Efficiency Comparison
| Algorithm | Relative Sample Efficiency | Inference Speed | Best For |
|---|---|---|---|
| BC-RNN | Low (1×) | Fast (< 5ms) | Simple tasks, constrained compute |
| ACT | High (5–10×) | Medium (10–20ms) | Precise bimanual, short-horizon |
| Diffusion Policy | Medium (3–7×) | Slow (50–200ms) | Multi-modal, contact-rich |
| π0 / foundation model fine-tune | Very high (10–20×) | Slow (100–500ms) | Broad generalization, novel objects |
| From-scratch CNN-MLP | Very low (0.5×) | Very fast (< 2ms) | Constrained robots, simple tasks only |
Practical Budgeting Guide
A systematic approach to demo budgeting before committing to a full collection program:
- Step 1 — Pilot collection: Collect exactly 200 demonstrations. This is the minimum viable dataset for any non-trivial task with a modern algorithm.
- Step 2 — Train and measure: Train your target algorithm on the 200 demos. Measure success rate on a held-out evaluation set of at least 50 trials.
- Step 3 — Learning curve extrapolation: If 200 demos gives you S% success rate and your target is T%, fit a logarithmic curve to estimate the total demos needed. A rough rule: from the 200-demo baseline, each doubling of the dataset produces roughly 60–70% of the remaining gap to theoretical maximum.
- Step 4 — Algorithm reassessment: If your 200-demo success rate is below 40%, consider switching algorithms or adding a pretrained visual encoder before investing in more data. A weak signal at 200 demos usually indicates an algorithm-task mismatch, not a data quantity problem.
- Step 5 — Budget for the flywheel: Plan initial collection at 50–60% of estimated total requirement. Reserve the remaining budget for failure-targeted collection after initial deployment. Failure-targeted data is 2–5× more efficient than random collection for closing the final performance gap.
Diminishing Returns: The Knee of the Learning Curve
Every learning curve has a "knee" — the inflection point beyond which marginal returns drop sharply. In our empirical data across 30+ collection programs, the knee consistently appears at 70–80% of the task's empirical maximum success rate.
Below the knee, each new batch of 100 demonstrations produces a 3–8% improvement in success rate. Above the knee, each 5% improvement requires 3–10× as many demonstrations as the previous 5% improvement.
This has a direct implication for program management: if your success rate is above 85% and your target is 90%, you are operating above the knee. Budget accordingly — expect to need 500–1,500 targeted demonstrations to close that 5-point gap, not 100.
For tasks where 90%+ is required (safety-critical manipulation, medical device assembly), the cost of the final 5–10% of performance is frequently higher than the cost of reaching 85%. Plan for this from the start.
SVRC's data collection services include learning curve estimation as part of every program kickoff. Our Fearless Platform tracks your learning curve in real time, automatically flagging when you cross the knee and activating failure-targeted collection mode.