Definitions

Zero-shot means executing a new task or handling a new object with no new demonstrations — the robot relies entirely on what it learned during pre-training. Few-shot means adapting to a new task with 5-50 new demonstrations. These are meaningfully different capabilities, and conflating them leads to unrealistic planning.

The Current Reality for Zero-Shot

Foundation models for robot manipulation — OpenVLA, Octo, RT-2 — have demonstrated genuine zero-shot capability on simple tasks within their training distribution. The results that have been validated and reproduced across multiple labs:

  • Open-vocabulary object detection + simple top-down grasp planning works zero-shot for ~60% of common household objects presented in upright orientation on a clear surface
  • Language-conditioned navigation in previously explored environments works zero-shot with ~75% success in structured settings
  • Pick-and-place with familiar object categories (mugs, bottles, blocks) achieves 50-65% success zero-shot with OpenVLA on standard benchmarks

Where zero-shot reliably fails: precision tasks requiring sub-5mm placement, dexterous manipulation, novel tool use, tasks where the object pose is non-canonical (tilted bottles, stacked cups), and any task involving deformable objects not well-represented in training data.

Few-Shot Fine-Tuning: The More Practical Capability

Few-shot fine-tuning (20-100 demonstrations on a new task) is where foundation models show their clearest practical value. The comparison that matters:

Training ApproachDemos RequiredTypical Success RateTime to Train
Foundation model, zero-shot030–65% (simple tasks)N/A
Foundation model + 20 demo fine-tune2070–80%30 min GPU
Foundation model + 100 demo fine-tune10080–90%2 hr GPU
Train from scratch (ACT)50075–88%3–4 hr GPU
Train from scratch (Diffusion)100082–92%8–12 hr GPU

The foundation model advantage is most pronounced in the low-data regime (under 100 demonstrations). With 20 demonstrations, a fine-tuned foundation model achieves success rates comparable to training ACT from scratch on 500 demonstrations. That is a 25× data efficiency improvement — which translates directly to cost and time savings.

Where Zero-Shot Actually Works Today

  • Structured picking with clear object detection: DETIC + GraspNet-style open-vocabulary detection + simple grasp planner works zero-shot for regular objects in organized bins. This is production-ready today for e-commerce and logistics.
  • Language-conditioned navigation in known spaces: VLN (Vision-Language Navigation) models work zero-shot in spaces they were trained to understand, with good generalization to same-layout spaces in different buildings.
  • Object recognition and sorting by category: Language-conditioned sorting ("put the red items in the left bin") works zero-shot for known categories with RGB classification.

Where It Does Not (Yet)

  • Contact-rich manipulation: Peg-in-hole, snap connectors, folding fabric, unstacking cups — zero-shot success rates are 10-30% for current foundation models. Not reliable for production.
  • Novel tool use: Using an unfamiliar tool (a can opener, a specific screwdriver) zero-shot is not yet reliable. Few-shot (20-50 demos) works.
  • Dexterous manipulation: In-hand re-grasping, rotation of objects using finger control — outside current zero-shot capability for all production models.

Practical Guidance

Plan for 200-500 demonstrations for any new task even with foundation models. Zero-shot performance is a bonus to be measured, not a baseline to be assumed. If zero-shot achieves 60%+ success on your task, consider yourself ahead of schedule. If it achieves 30%, proceed with your planned fine-tuning data collection.

The SVRC data services team can assess your specific task for zero-shot viability and recommend a realistic demonstration budget before you commit to a collection timeline.