Why Evaluation Is Hard
Manipulation policy evaluation is deceptively difficult. A policy can achieve 95% success on the exact objects and lighting conditions it was trained on, yet fail completely when a single variable changes — a phenomenon called lab overfitting. The policy memorizes the test environment rather than learning to generalize.
The core problem is that researchers control too many variables during both training and evaluation. The same camera position, the same table height, the same objects in the same orientations — every constant becomes a crutch. When you report "95% success rate," readers assume generalization, but you may have measured memorization.
Rigorous evaluation requires intentional held-out conditions: objects never seen during training, lighting never used during data collection, table heights deliberately different from the training setup. If your policy degrades significantly under these conditions, you have measured memorization, not capability.
Evaluation Dimensions
A complete manipulation evaluation covers three primary dimensions:
- Success rate — both binary (task complete / not complete) and partial credit. Binary success is easy to compute but throws away information. A policy that reliably grasps correctly but drops during transfer deserves a different score than one that never achieves a stable grasp. Define partial credit rubrics before running trials.
- Efficiency — time to task completion (seconds), number of human interventions required, and number of re-grasp attempts. A policy that succeeds in 45 seconds beats one that succeeds in 3 minutes for deployment. Interventions are especially important: a policy requiring 2 interventions per task is not deployable even at 90% success.
- Robustness — variance across trials (coefficient of variation), performance under novel conditions (new objects, new lighting, new clutter), and graceful degradation. A policy with 80% mean success but 40% standard deviation is less useful than one with 75% mean and 5% standard deviation.
Test Set Design
Test set design is where most evaluation protocols fail. A rigorous minimum for a pick-and-place task:
3 novel lighting conditions × 2 table heights × 5 novel objects × 3 distractor configurations = 90 test conditions minimum.
Novel lighting means illumination setups not used during any training data collection — overhead fluorescent, directional LED, diffuse ambient. Novel objects means items with similar affordances but different visual appearance: if you trained on a red cup, test with a blue mug, a transparent glass, and a metal can. Distractor configurations test whether the policy attends to the target object or gets confused by nearby items.
For each test condition, run at least 5 trials to estimate per-condition success rate. This gives you 450 total trials for the 90-condition matrix. Log every trial — failures are as informative as successes.
Metrics Table
| Metric | Definition | How to Measure | Deployment Threshold |
|---|---|---|---|
| Binary success rate | Task completed without intervention (%) | Human observer scores each trial | ≥ 80% on novel objects |
| Partial credit score | Weighted sub-task completion (0–1) | Rubric-based scoring per phase | ≥ 0.85 mean score |
| Time to completion | Wall-clock seconds from start signal | Automated timer, log per trial | ≤ 2× human baseline |
| Intervention rate | Human resets per 100 trials | Count interventions per session | ≤ 5 per 100 trials |
| Novel object transfer | Success on held-out object set (%) | Separate test set, never in training | ≥ 60% (generalization) |
| Robustness variance | Std dev of success across conditions | Per-condition trial average | CV ≤ 0.20 |
Statistical Requirements
A single trial proves nothing. The minimum trial count for statistically meaningful results:
N = 100 trials for a ±10 percentage point confidence interval on success rate (95% confidence, binomial distribution). With N=20 trials — the most common number in published papers — your confidence interval is ±22 points, making 70% and 92% indistinguishable.
For algorithm comparison (e.g., "our method vs. ACT baseline"), use a paired t-test across matched trial conditions. Run both algorithms under identical conditions on the same day, in the same environment. Report p-value, effect size (Cohen's d), and confidence intervals — not just "our method is better."
For multi-condition evaluations, use a mixed-effects model with condition as a random effect. This accounts for systematic variation across conditions rather than treating all trials as independent.
Environmental Controls
Physics is not constant. Small environmental changes affect repeatability more than most researchers assume:
- Camera position log — mark exact mount points with tape or fixed brackets. A 2cm camera shift changes the input distribution enough to degrade policy performance measurably.
- Lighting log — record lux levels at the workspace surface before each session. Natural light through windows changes illumination by 5000–10000 lux across a day. Use blackout curtains or only run trials under artificial lighting.
- Temperature log — rubber grasp surfaces change stiffness with temperature. At 18°C vs. 28°C, the same gripper finger produces measurably different contact forces. Log ambient temperature and report it.
- Object placement protocol — define exact placement zones (e.g., "object center within 5cm radius of marked target, random orientation"). Random placement introduces variance; no-placement-protocol makes trials incomparable.
Common Evaluation Mistakes
- Testing on training objects — the most common mistake. If any test object appeared in training data (same SKU, same color), your success rate measures memorization. Maintain a strict held-out object set from the start of data collection.
- Single lighting condition — nearly all published manipulation results use a single lighting setup. This makes policies appear far more robust than they are in practice.
- Fewer than 20 trials — statistically meaningless. A policy that succeeded 4/5 times and one that succeeded 14/20 times have overlapping confidence intervals. You cannot distinguish them.
- Cherry-picked demonstrations — selecting which trials to report, or stopping early when results look good. Pre-register your evaluation protocol before running trials.
- Evaluator bias — the researcher running trials knows which condition is the "better" algorithm. Use blinded evaluation where possible, or at minimum automate success detection.
Reporting Standards
A complete manipulation evaluation report includes: (1) full test set description with object photos and condition matrix, (2) trial count per condition, (3) confidence intervals on all reported metrics, (4) statistical test results for comparisons, (5) environmental conditions logged, (6) video of representative success and failure cases, (7) failure mode taxonomy.
Share your evaluation data and code. Reproducibility in manipulation research is poor; providing raw trial logs and evaluation scripts lets others build on your work rather than re-implementing from scratch.
The SVRC data platform provides structured evaluation logging, automated trial scoring, and standardized reporting templates for manipulation policy benchmarking.