Deploying Robot Policies to Production: A Reliability Engineering Guide

The Lab-to-Production Gap

An 80% success rate in the lab does not mean 80% success in production. This is the single most important lesson in robot policy deployment, and it surprises nearly every team the first time.

Production environments differ from the lab in ways that are invisible until they cause failures: novel lighting conditions (different times of day, seasonal light angle, overhead fixture replacement), wear-induced drift (joint backlash increases after 50,000 cycles, gripper pad wear changes grasp mechanics), object variation (supplier changes product packaging slightly, objects arrive in non-canonical orientations), and context drift (the workspace gets slightly reorganized, a background object is moved). Each of these individually degrades policy performance by 5–20%. Combined, an 80% lab policy can drop to 40% production performance within 3 months.

The solution is not better lab performance — it is building systems that detect degradation, fail safely, and recover automatically.

Pre-Deployment Checklist

Before any policy enters production, it must pass a structured evaluation in conditions designed to probe generalization:

3 novel lighting conditions: Test with overhead only, natural light + overhead, and desk lamp positioned differently from training. The policy must achieve ≥70% success in each condition.
5 novel object positions: Place target objects at positions not seen during training, including near workspace boundaries. Any position within the declared workspace boundary must be handled.
10 distractor objects: Add objects not present during training to the workspace. A well-trained policy should maintain ≥85% of its base success rate with distractors present.
100 consecutive trial evaluation: Run 100 trials autonomously overnight. This catches intermittent failures (jammed gripper after 30 cycles, thermal throttling after 45 minutes) that short evaluations miss. Target ≥85% over 100 trials to enter production.
Edge case scenarios: Explicitly test: object slightly outside nominal pose, gripper partially occluded, arm joint at near-limit position, camera partially obstructed.

Model Serving Infrastructure

Inference latency directly affects robot control rate. For a policy running at 10 Hz (100 ms control period), your inference must complete in <80 ms to leave margin for communication overhead.

TorchServe: Deploy PyTorch policies as a model archive (.mar). Provides HTTP and gRPC inference endpoints, batching, model versioning, and metrics. Suitable for policies with >50 ms inference time where a dedicated model server is warranted.
TensorRT: Convert your policy to a TensorRT engine for 3–5× inference speedup on NVIDIA GPUs. An ACT policy that takes 80 ms in PyTorch typically runs in 18–25 ms with TensorRT FP16. Use trtexec --onnx=policy.onnx --saveEngine=policy.trt --fp16 to convert.
Latency target: p99 inference latency must be <100 ms. p99 (99th percentile) matters more than mean because the 1% worst-case latency determines your control loop's worst-case jitter. Profile with torch.profiler under simulated production load.
Health check endpoint: Expose a GET /health endpoint that runs a dummy inference pass and returns 200 OK with latency measurement. The robot controller should poll this endpoint at startup and reject deployment if p99 >100 ms.

Monitoring Strategy

A policy in production without monitoring is a time bomb. Build monitoring from day one, not after the first incident.

Per-episode success rate: Log success/failure for every episode. Track 7-day rolling average. Alert when the rolling average drops >5% from the deployment baseline.
Failure classification: When an episode fails, classify the failure mode: grasp failure, placement failure, collision, timeout, or other. Different failure modes indicate different root causes and different fixes.
Telemetry logging: Log joint positions, velocities, forces, policy confidence scores, and inference latency for every episode. Store for 90 days minimum. This data is essential for root cause analysis and retraining.
Human review queue: Flag every failed episode for human review within 24 hours. A 5-minute human review per failure catches systematic issues (new object variant, mounting drift) before they cascade.

Graceful Degradation

A production robot must fail safely. The worst outcome is silent failure — a robot that continues operating while producing bad outputs.

Confidence score threshold: Many policies output a confidence or certainty estimate alongside the action. If confidence <0.7, pause the robot and alert the operator before proceeding. This prevents catastrophic grasps in novel situations the policy is not confident about.
Pause and alert: When a pause trigger fires, move the arm to a safe home position, turn on a visual indicator (red status light), and send an alert via the platform to the operator's dashboard and mobile device.
Fallback to teleop: For high-value or high-risk tasks, implement a teleop fallback where a remote operator takes control via a VR headset or web interface when the policy triggers a pause. The operator completes the episode manually, and the data is logged for retraining.
Maximum consecutive failure limit: If 5 consecutive episodes fail, automatically suspend the policy and escalate to a senior operator. Do not let a failing policy cycle indefinitely.

Version Management

Treat policy versions like software releases — with staged rollouts and rollback capability.

A/B testing: When deploying a new policy version, route 10% of tasks to the new version and 90% to the current production version. Compare success rates over 200+ episodes before full rollout. This requires task routing logic in your platform dashboard.
Canary rollout: After A/B testing shows improvement, roll out to 25% → 50% → 100% of traffic at weekly intervals, with automated rollback if success rate drops >5% at any stage.
Rollback procedure: Maintain the last 3 production policy versions as deployable artifacts. Rollback to the previous version must be executable in <5 minutes, ideally via a single button in the fleet dashboard.

Retraining Triggers

Retraining is not a one-time event — it is a continuous process driven by data from production.

>5% success rate drop: Investigate root cause. If caused by distributional shift (new object, changed workspace), collect 50–200 demonstrations covering the new conditions and fine-tune.
New task variants: When the business introduces a new SKU, product variant, or workflow change, trigger a data collection campaign before the variant reaches production volume.
Quarterly refresh: Even without a specific trigger, retrain quarterly incorporating all production failure episodes. This prevents gradual drift accumulation.

Incident Runbook Template

Phase	Actions	Owner	Time Target
Detect	Alert fires (success rate drop or consecutive failures)	Automated	<5 min
Classify	Review failure clips, classify failure mode	On-call operator	<30 min
Contain	Suspend affected policy, route tasks to manual/teleop	On-call operator	<15 min
Diagnose	Identify root cause: hardware drift, distributional shift, infrastructure issue	ML engineer	<4 hr
Resolve	Deploy fix: rollback, hotfix, or retrain	ML engineer	<24 hr
Post-mortem	Document cause, impact, fix, and prevention measures	Team lead	<1 week

Model Serving Comparison: Which Framework to Use

Framework	Typical Inference (ACT)	Versioning	Rollback	Best For
Raw PyTorch	60-100 ms (RTX 4070)	Manual (file paths)	Manual	Prototyping, single robot
TorchServe	70-110 ms	Built-in model store	API-driven	Multi-model serving, A/B testing
TensorRT FP16	18-30 ms (RTX 4070)	Manual (engine files)	Manual	Low-latency production, Jetson edge
Triton Inference Server	20-35 ms (with TRT backend)	Model repository	API-driven	Fleet-scale, multi-GPU, mixed models
FastAPI + ONNX Runtime	35-60 ms	Custom	Custom	Simple REST integration, CPU fallback
ROS2 Service Node	65-100 ms	Launch file	Node restart	Native ROS2 integration, single robot

For a single robot in production, start with TensorRT FP16 conversion for lowest latency. For fleets of 5+ robots, invest in Triton Inference Server -- its model repository and dynamic batching features justify the setup complexity.

Deployment Architecture: Single Robot vs. Fleet

Single robot deployment: The policy runs on the robot's onboard GPU (Jetson Orin, RTX 4060 workstation). Observations flow from cameras and joint encoders directly to the inference process. No network latency in the control loop. This is the simplest and most reliable architecture.

Edge-cloud hybrid (recommended for fleets): Low-level control (safety, joint servo, e-stop) runs on the robot's onboard computer. Policy inference runs on an edge server (one per 5-10 robots) with a high-end GPU. Communication is via a dedicated 1 Gbps LAN with <5 ms latency. The edge server also handles monitoring, logging, and model updates.

Cloud-based inference (not recommended for manipulation): Policy inference runs on a cloud GPU. Network latency adds 20-100 ms to the control loop, making it unsuitable for contact-rich manipulation at 10+ Hz. Only viable for mobile robot navigation or very slow pick-and-place tasks.

Rollback Procedure: Step-by-Step

Step 1 -- Detect degradation: Monitoring alerts on success rate drop >5% or >3 consecutive failures.
Step 2 -- Suspend current policy: Via the platform dashboard or CLI: svrc policy suspend --robot arm-001 --policy v2.3. The robot enters safe idle state.
Step 3 -- Activate previous version: svrc policy activate --robot arm-001 --policy v2.2. The previous version (stored locally on the robot) loads in <30 seconds.
Step 4 -- Verify: Run 10 automated trials on the previous version. If success rate returns to baseline, confirm rollback is complete.
Step 5 -- Root cause analysis: Investigate why v2.3 failed. Common causes: training data distribution shift, hyperparameter regression, or infrastructure change (camera calibration drift).

Total rollback time target: <5 minutes from detection to resumed operation on the previous version. Practice the rollback procedure monthly even when no incident occurs.

Related Guides

Remote Fleet Management -- monitoring infrastructure and OTA update procedures
Curriculum Design for Robot Learning -- training policies that generalize better to production
Data Collection Service Buyer's Guide -- sourcing high-quality training data for retraining
Robot Safety Risk Assessment -- safety requirements for production robot deployments
Warehouse Deployment Checklist -- end-to-end production deployment planning

Work with SVRC

SVRC helps teams bridge the lab-to-production gap with infrastructure, monitoring, and ongoing data collection support.

Data Platform -- policy monitoring dashboards, A/B testing, fleet management, and automated rollback
Data Collection Services -- rapid retraining data collection when production policies degrade
Robot Leasing -- production-ready robot systems with integrated monitoring and maintenance
Contact Us -- schedule a deployment readiness review with our engineering team