The Lab-to-Production Gap

An 80% success rate in the lab does not mean 80% success in production. This is the single most important lesson in robot policy deployment, and it surprises nearly every team the first time.

Production environments differ from the lab in ways that are invisible until they cause failures: novel lighting conditions (different times of day, seasonal light angle, overhead fixture replacement), wear-induced drift (joint backlash increases after 50,000 cycles, gripper pad wear changes grasp mechanics), object variation (supplier changes product packaging slightly, objects arrive in non-canonical orientations), and context drift (the workspace gets slightly reorganized, a background object is moved). Each of these individually degrades policy performance by 5–20%. Combined, an 80% lab policy can drop to 40% production performance within 3 months.

The solution is not better lab performance — it is building systems that detect degradation, fail safely, and recover automatically.

Pre-Deployment Checklist

Before any policy enters production, it must pass a structured evaluation in conditions designed to probe generalization:

  • 3 novel lighting conditions: Test with overhead only, natural light + overhead, and desk lamp positioned differently from training. The policy must achieve ≥70% success in each condition.
  • 5 novel object positions: Place target objects at positions not seen during training, including near workspace boundaries. Any position within the declared workspace boundary must be handled.
  • 10 distractor objects: Add objects not present during training to the workspace. A well-trained policy should maintain ≥85% of its base success rate with distractors present.
  • 100 consecutive trial evaluation: Run 100 trials autonomously overnight. This catches intermittent failures (jammed gripper after 30 cycles, thermal throttling after 45 minutes) that short evaluations miss. Target ≥85% over 100 trials to enter production.
  • Edge case scenarios: Explicitly test: object slightly outside nominal pose, gripper partially occluded, arm joint at near-limit position, camera partially obstructed.

Model Serving Infrastructure

Inference latency directly affects robot control rate. For a policy running at 10 Hz (100 ms control period), your inference must complete in <80 ms to leave margin for communication overhead.

  • TorchServe: Deploy PyTorch policies as a model archive (.mar). Provides HTTP and gRPC inference endpoints, batching, model versioning, and metrics. Suitable for policies with >50 ms inference time where a dedicated model server is warranted.
  • TensorRT: Convert your policy to a TensorRT engine for 3–5× inference speedup on NVIDIA GPUs. An ACT policy that takes 80 ms in PyTorch typically runs in 18–25 ms with TensorRT FP16. Use trtexec --onnx=policy.onnx --saveEngine=policy.trt --fp16 to convert.
  • Latency target: p99 inference latency must be <100 ms. p99 (99th percentile) matters more than mean because the 1% worst-case latency determines your control loop's worst-case jitter. Profile with torch.profiler under simulated production load.
  • Health check endpoint: Expose a GET /health endpoint that runs a dummy inference pass and returns 200 OK with latency measurement. The robot controller should poll this endpoint at startup and reject deployment if p99 >100 ms.

Monitoring Strategy

A policy in production without monitoring is a time bomb. Build monitoring from day one, not after the first incident.

  • Per-episode success rate: Log success/failure for every episode. Track 7-day rolling average. Alert when the rolling average drops >5% from the deployment baseline.
  • Failure classification: When an episode fails, classify the failure mode: grasp failure, placement failure, collision, timeout, or other. Different failure modes indicate different root causes and different fixes.
  • Telemetry logging: Log joint positions, velocities, forces, policy confidence scores, and inference latency for every episode. Store for 90 days minimum. This data is essential for root cause analysis and retraining.
  • Human review queue: Flag every failed episode for human review within 24 hours. A 5-minute human review per failure catches systematic issues (new object variant, mounting drift) before they cascade.

Graceful Degradation

A production robot must fail safely. The worst outcome is silent failure — a robot that continues operating while producing bad outputs.

  • Confidence score threshold: Many policies output a confidence or certainty estimate alongside the action. If confidence <0.7, pause the robot and alert the operator before proceeding. This prevents catastrophic grasps in novel situations the policy is not confident about.
  • Pause and alert: When a pause trigger fires, move the arm to a safe home position, turn on a visual indicator (red status light), and send an alert via the platform to the operator's dashboard and mobile device.
  • Fallback to teleop: For high-value or high-risk tasks, implement a teleop fallback where a remote operator takes control via a VR headset or web interface when the policy triggers a pause. The operator completes the episode manually, and the data is logged for retraining.
  • Maximum consecutive failure limit: If 5 consecutive episodes fail, automatically suspend the policy and escalate to a senior operator. Do not let a failing policy cycle indefinitely.

Version Management

Treat policy versions like software releases — with staged rollouts and rollback capability.

  • A/B testing: When deploying a new policy version, route 10% of tasks to the new version and 90% to the current production version. Compare success rates over 200+ episodes before full rollout. This requires task routing logic in your platform dashboard.
  • Canary rollout: After A/B testing shows improvement, roll out to 25% → 50% → 100% of traffic at weekly intervals, with automated rollback if success rate drops >5% at any stage.
  • Rollback procedure: Maintain the last 3 production policy versions as deployable artifacts. Rollback to the previous version must be executable in <5 minutes, ideally via a single button in the fleet dashboard.

Retraining Triggers

Retraining is not a one-time event — it is a continuous process driven by data from production.

  • >5% success rate drop: Investigate root cause. If caused by distributional shift (new object, changed workspace), collect 50–200 demonstrations covering the new conditions and fine-tune.
  • New task variants: When the business introduces a new SKU, product variant, or workflow change, trigger a data collection campaign before the variant reaches production volume.
  • Quarterly refresh: Even without a specific trigger, retrain quarterly incorporating all production failure episodes. This prevents gradual drift accumulation.

Incident Runbook Template

PhaseActionsOwnerTime Target
DetectAlert fires (success rate drop or consecutive failures)Automated<5 min
ClassifyReview failure clips, classify failure modeOn-call operator<30 min
ContainSuspend affected policy, route tasks to manual/teleopOn-call operator<15 min
DiagnoseIdentify root cause: hardware drift, distributional shift, infrastructure issueML engineer<4 hr
ResolveDeploy fix: rollback, hotfix, or retrainML engineer<24 hr
Post-mortemDocument cause, impact, fix, and prevention measuresTeam lead<1 week