Latency Requirements by Task Type

Before choosing inference architecture, you need a precise latency requirement for your task. The required latency is set by the robot's control frequency and the nature of the manipulation:

  • Free-space motion (arm moving to a target): 50–100 ms inference latency is acceptable. The arm moves at low speed during approach phases, and a 100 ms-stale policy query does not cause positioning errors above the task tolerance.
  • Contact-rich manipulation (insertion, assembly): 10–20 ms inference latency required. At the moment of contact, small state changes (0.5–1 mm position error) can cause task failure, and the policy must re-query frequently to stay on-distribution.
  • Reactive grasping (moving object, conveyor): <50 ms required. Object position is changing between policy queries; stale queries lead to systematic misses.
  • Safety-critical stop: <1 ms. This cannot depend on the policy inference loop at all — it must be handled by a dedicated safety controller on the robot.

Edge Hardware Options

Edge inference hardware runs co-located with the robot, eliminating network round-trip latency. The tradeoff is higher upfront cost and local maintenance responsibility.

HardwareTOPS (INT8)PricePowerForm FactorBest For
NVIDIA Jetson AGX Orin 64GB275 TOPS$499 (module)15–60WEmbedded moduleFull policy inference at edge
NVIDIA Jetson Orin NX 16GB100 TOPS$299 (module)10–25WEmbedded moduleSmaller models, power-constrained
NVIDIA RTX 4090 (workstation)~1,600 TOPS (FP16)$1,600450W TDPDesktop PCIeLarge model inference, multiple robots
Intel NUC with Arc GPU~40 TOPS$600–$90035WMini PCSimple BC policies, low cost
Raspberry Pi 5~5 TOPS$805–10WSBCMLP policies only, very simple tasks

For most manipulation tasks running ACT or diffusion policy, the NVIDIA Jetson AGX Orin is the edge hardware of choice. Its 275 TOPS INT8 performance handles ACT inference in 30–80 ms and quantized diffusion in 20–50 ms.

Model Size vs. Inference Latency

ModelParametersEdge (Jetson AGX)Cloud (A100)Quantized INT8 Edge
BC (MLP policy)~1M1–3 ms<1 ms1–2 ms
ACT (original)~84M50–100 ms15–30 ms20–50 ms (FP16)
Diffusion Policy (U-Net)~100M30–80 ms10–30 ms20–50 ms
Diffusion Policy (Transformer)~300M200–500 ms80–200 ms100–250 ms
OpenVLA (7B)7B3–10 s200–500 ms1–3 s (INT4)
Octo (93M)93M30–100 ms15–40 ms20–60 ms

Quantization Strategies

Quantization reduces model precision to reduce memory footprint and increase inference speed, at a cost to accuracy:

  • FP32 → FP16 (half precision): 2× memory reduction, <1–3% accuracy loss on most manipulation models. Supported natively on all NVIDIA GPUs since Pascal. Recommended as the default for edge deployment.
  • FP16 → INT8: Further 2× memory reduction, 10–15% accuracy loss typical for manipulation policies. Acceptable for L1–L2 tasks; test carefully for L3+ precision tasks. Use NVIDIA TensorRT for Jetson-optimized INT8 inference.
  • FP16 → INT4 (4-bit): 4× reduction vs. FP16. 15–25% accuracy loss. Primarily useful for LLM-based models (OpenVLA) where the language backbone represents most parameters. Use bitsandbytes (NF4 format) or GPTQ.
  • TensorRT optimization: Beyond quantization, TensorRT fuses operations, optimizes memory layout, and compiles to optimized CUDA kernels for the target Jetson hardware. Often provides 2–4× additional speedup on top of FP16 quantization for convolutional models.

Cloud Inference: When It's Acceptable

Cloud inference is viable for specific use cases where latency tolerance exists:

  • Policy selection / task planning: A high-level planner that selects which low-level skill to execute next can tolerate 500 ms–2 s latency. This is a natural split point: run VLM-based planning in the cloud, execute low-level skills at the edge.
  • Scene understanding: Semantic scene analysis (object classification, affordance estimation) for pre-grasp planning can run in the cloud if the robot pauses before initiating manipulation.
  • Non-real-time teleoperation monitoring: A cloud-based monitoring system that watches robot operation and flags anomalies (without taking direct control) tolerates any latency.

Cost Comparison: 1-Year TCO

ApproachUpfrontOngoing/Year1-Year TCONotes
Jetson AGX Orin (1 robot)$500–$800$0 (owned)$800Carrier board adds $200–400
RTX 4090 workstation (5 robots)$3,000$0 (owned)$3,000$600/robot amortized over 1 year
Cloud GPU (A100, on-demand)$0$2,200–$4,400$2,200–$4,400$0.25–$0.50/hr × 8,760 hr (24/7)
Cloud GPU (reserved instance)$0$800–$1,600$800–$1,600~50% discount for 1-year reservation
Jetson + cloud hybrid$500–$800$400–$800$900–$1,600Edge for real-time, cloud for training

Edge hardware wins decisively for 24/7 operation. Cloud wins for development and fine-tuning cycles where GPU utilization is <40% (on-demand pricing). A hybrid approach — edge for production inference, cloud for periodic re-training — provides the best of both.

The SVRC platform provides a cloud inference endpoint for development and evaluation, with edge deployment packaging (TensorRT, Jetson) for production.