Latency Requirements by Task Type
Before choosing inference architecture, you need a precise latency requirement for your task. The required latency is set by the robot's control frequency and the nature of the manipulation:
- Free-space motion (arm moving to a target): 50–100 ms inference latency is acceptable. The arm moves at low speed during approach phases, and a 100 ms-stale policy query does not cause positioning errors above the task tolerance.
- Contact-rich manipulation (insertion, assembly): 10–20 ms inference latency required. At the moment of contact, small state changes (0.5–1 mm position error) can cause task failure, and the policy must re-query frequently to stay on-distribution.
- Reactive grasping (moving object, conveyor): <50 ms required. Object position is changing between policy queries; stale queries lead to systematic misses.
- Safety-critical stop: <1 ms. This cannot depend on the policy inference loop at all — it must be handled by a dedicated safety controller on the robot.
Edge Hardware Options
Edge inference hardware runs co-located with the robot, eliminating network round-trip latency. The tradeoff is higher upfront cost and local maintenance responsibility.
| Hardware | TOPS (INT8) | Price | Power | Form Factor | Best For |
|---|---|---|---|---|---|
| NVIDIA Jetson AGX Orin 64GB | 275 TOPS | $499 (module) | 15–60W | Embedded module | Full policy inference at edge |
| NVIDIA Jetson Orin NX 16GB | 100 TOPS | $299 (module) | 10–25W | Embedded module | Smaller models, power-constrained |
| NVIDIA RTX 4090 (workstation) | ~1,600 TOPS (FP16) | $1,600 | 450W TDP | Desktop PCIe | Large model inference, multiple robots |
| Intel NUC with Arc GPU | ~40 TOPS | $600–$900 | 35W | Mini PC | Simple BC policies, low cost |
| Raspberry Pi 5 | ~5 TOPS | $80 | 5–10W | SBC | MLP policies only, very simple tasks |
For most manipulation tasks running ACT or diffusion policy, the NVIDIA Jetson AGX Orin is the edge hardware of choice. Its 275 TOPS INT8 performance handles ACT inference in 30–80 ms and quantized diffusion in 20–50 ms.
Model Size vs. Inference Latency
| Model | Parameters | Edge (Jetson AGX) | Cloud (A100) | Quantized INT8 Edge |
|---|---|---|---|---|
| BC (MLP policy) | ~1M | 1–3 ms | <1 ms | 1–2 ms |
| ACT (original) | ~84M | 50–100 ms | 15–30 ms | 20–50 ms (FP16) |
| Diffusion Policy (U-Net) | ~100M | 30–80 ms | 10–30 ms | 20–50 ms |
| Diffusion Policy (Transformer) | ~300M | 200–500 ms | 80–200 ms | 100–250 ms |
| OpenVLA (7B) | 7B | 3–10 s | 200–500 ms | 1–3 s (INT4) |
| Octo (93M) | 93M | 30–100 ms | 15–40 ms | 20–60 ms |
Quantization Strategies
Quantization reduces model precision to reduce memory footprint and increase inference speed, at a cost to accuracy:
- FP32 → FP16 (half precision): 2× memory reduction, <1–3% accuracy loss on most manipulation models. Supported natively on all NVIDIA GPUs since Pascal. Recommended as the default for edge deployment.
- FP16 → INT8: Further 2× memory reduction, 10–15% accuracy loss typical for manipulation policies. Acceptable for L1–L2 tasks; test carefully for L3+ precision tasks. Use NVIDIA TensorRT for Jetson-optimized INT8 inference.
- FP16 → INT4 (4-bit): 4× reduction vs. FP16. 15–25% accuracy loss. Primarily useful for LLM-based models (OpenVLA) where the language backbone represents most parameters. Use bitsandbytes (NF4 format) or GPTQ.
- TensorRT optimization: Beyond quantization, TensorRT fuses operations, optimizes memory layout, and compiles to optimized CUDA kernels for the target Jetson hardware. Often provides 2–4× additional speedup on top of FP16 quantization for convolutional models.
Cloud Inference: When It's Acceptable
Cloud inference is viable for specific use cases where latency tolerance exists:
- Policy selection / task planning: A high-level planner that selects which low-level skill to execute next can tolerate 500 ms–2 s latency. This is a natural split point: run VLM-based planning in the cloud, execute low-level skills at the edge.
- Scene understanding: Semantic scene analysis (object classification, affordance estimation) for pre-grasp planning can run in the cloud if the robot pauses before initiating manipulation.
- Non-real-time teleoperation monitoring: A cloud-based monitoring system that watches robot operation and flags anomalies (without taking direct control) tolerates any latency.
Cost Comparison: 1-Year TCO
| Approach | Upfront | Ongoing/Year | 1-Year TCO | Notes |
|---|---|---|---|---|
| Jetson AGX Orin (1 robot) | $500–$800 | $0 (owned) | $800 | Carrier board adds $200–400 |
| RTX 4090 workstation (5 robots) | $3,000 | $0 (owned) | $3,000 | $600/robot amortized over 1 year |
| Cloud GPU (A100, on-demand) | $0 | $2,200–$4,400 | $2,200–$4,400 | $0.25–$0.50/hr × 8,760 hr (24/7) |
| Cloud GPU (reserved instance) | $0 | $800–$1,600 | $800–$1,600 | ~50% discount for 1-year reservation |
| Jetson + cloud hybrid | $500–$800 | $400–$800 | $900–$1,600 | Edge for real-time, cloud for training |
Edge hardware wins decisively for 24/7 operation. Cloud wins for development and fine-tuning cycles where GPU utilization is <40% (on-demand pricing). A hybrid approach — edge for production inference, cloud for periodic re-training — provides the best of both.
The SVRC platform provides a cloud inference endpoint for development and evaluation, with edge deployment packaging (TensorRT, Jetson) for production.