Real-Time Robot Inference: Edge vs. Cloud Architecture Tradeoffs

Latency Requirements by Task Type

Before choosing inference architecture, you need a precise latency requirement for your task. The required latency is set by the robot's control frequency and the nature of the manipulation:

Free-space motion (arm moving to a target): 50–100 ms inference latency is acceptable. The arm moves at low speed during approach phases, and a 100 ms-stale policy query does not cause positioning errors above the task tolerance.
Contact-rich manipulation (insertion, assembly): 10–20 ms inference latency required. At the moment of contact, small state changes (0.5–1 mm position error) can cause task failure, and the policy must re-query frequently to stay on-distribution.
Reactive grasping (moving object, conveyor): <50 ms required. Object position is changing between policy queries; stale queries lead to systematic misses.
Safety-critical stop: <1 ms. This cannot depend on the policy inference loop at all — it must be handled by a dedicated safety controller on the robot.

Edge Hardware Options

Edge inference hardware runs co-located with the robot, eliminating network round-trip latency. The tradeoff is higher upfront cost and local maintenance responsibility.

Hardware	TOPS (INT8)	Price	Power	Form Factor	Best For
NVIDIA Jetson AGX Orin 64GB	275 TOPS	$499 (module)	15–60W	Embedded module	Full policy inference at edge
NVIDIA Jetson Orin NX 16GB	100 TOPS	$299 (module)	10–25W	Embedded module	Smaller models, power-constrained
NVIDIA RTX 4090 (workstation)	~1,600 TOPS (FP16)	$1,600	450W TDP	Desktop PCIe	Large model inference, multiple robots
Intel NUC with Arc GPU	~40 TOPS	$600–$900	35W	Mini PC	Simple BC policies, low cost
Raspberry Pi 5	~5 TOPS	$80	5–10W	SBC	MLP policies only, very simple tasks

For most manipulation tasks running ACT or diffusion policy, the NVIDIA Jetson AGX Orin is the edge hardware of choice. Its 275 TOPS INT8 performance handles ACT inference in 30–80 ms and quantized diffusion in 20–50 ms.

Model Size vs. Inference Latency

Model	Parameters	Edge (Jetson AGX)	Cloud (A100)	Quantized INT8 Edge
BC (MLP policy)	~1M	1–3 ms	<1 ms	1–2 ms
ACT (original)	~84M	50–100 ms	15–30 ms	20–50 ms (FP16)
Diffusion Policy (U-Net)	~100M	30–80 ms	10–30 ms	20–50 ms
Diffusion Policy (Transformer)	~300M	200–500 ms	80–200 ms	100–250 ms
OpenVLA (7B)	7B	3–10 s	200–500 ms	1–3 s (INT4)
Octo (93M)	93M	30–100 ms	15–40 ms	20–60 ms

Quantization Strategies

Quantization reduces model precision to reduce memory footprint and increase inference speed, at a cost to accuracy:

FP32 → FP16 (half precision): 2× memory reduction, <1–3% accuracy loss on most manipulation models. Supported natively on all NVIDIA GPUs since Pascal. Recommended as the default for edge deployment.
FP16 → INT8: Further 2× memory reduction, 10–15% accuracy loss typical for manipulation policies. Acceptable for L1–L2 tasks; test carefully for L3+ precision tasks. Use NVIDIA TensorRT for Jetson-optimized INT8 inference.
FP16 → INT4 (4-bit): 4× reduction vs. FP16. 15–25% accuracy loss. Primarily useful for LLM-based models (OpenVLA) where the language backbone represents most parameters. Use bitsandbytes (NF4 format) or GPTQ.
TensorRT optimization: Beyond quantization, TensorRT fuses operations, optimizes memory layout, and compiles to optimized CUDA kernels for the target Jetson hardware. Often provides 2–4× additional speedup on top of FP16 quantization for convolutional models.

Cloud Inference: When It's Acceptable

Cloud inference is viable for specific use cases where latency tolerance exists:

Policy selection / task planning: A high-level planner that selects which low-level skill to execute next can tolerate 500 ms–2 s latency. This is a natural split point: run VLM-based planning in the cloud, execute low-level skills at the edge.
Scene understanding: Semantic scene analysis (object classification, affordance estimation) for pre-grasp planning can run in the cloud if the robot pauses before initiating manipulation.
Non-real-time teleoperation monitoring: A cloud-based monitoring system that watches robot operation and flags anomalies (without taking direct control) tolerates any latency.

Cost Comparison: 1-Year TCO

Approach	Upfront	Ongoing/Year	1-Year TCO	Notes
Jetson AGX Orin (1 robot)	$500–$800	$0 (owned)	$800	Carrier board adds $200–400
RTX 4090 workstation (5 robots)	$3,000	$0 (owned)	$3,000	$600/robot amortized over 1 year
Cloud GPU (A100, on-demand)	$0	$2,200–$4,400	$2,200–$4,400	$0.25–$0.50/hr × 8,760 hr (24/7)
Cloud GPU (reserved instance)	$0	$800–$1,600	$800–$1,600	~50% discount for 1-year reservation
Jetson + cloud hybrid	$500–$800	$400–$800	$900–$1,600	Edge for real-time, cloud for training

Edge hardware wins decisively for 24/7 operation. Cloud wins for development and fine-tuning cycles where GPU utilization is <40% (on-demand pricing). A hybrid approach — edge for production inference, cloud for periodic re-training — provides the best of both.

The SVRC platform provides a cloud inference endpoint for development and evaluation, with edge deployment packaging (TensorRT, Jetson) for production.