Decision 1: ROS2 vs. Custom Middleware

This is the question every robotics startup debates and most get wrong by over-engineering. ROS2 gets you a running robot faster, gives you access to a large ecosystem of drivers and tools, and makes it easier to hire — most robotics engineers know ROS2. The pub/sub architecture handles the complexity of multi-node systems, and packages like MoveIt2, Nav2, and ros2_control handle problems that would take months to implement from scratch.

Custom middleware delivers two genuine advantages: latency below 5ms (ROS2's DDS overhead makes this nearly impossible) and freedom from licensing concerns if your commercial deployment model involves redistributing a modified middleware layer. There's also a real argument that custom middleware is simpler to debug when you fully own the stack.

The practical rule: use ROS2 unless you are at Series B or later with a shipping product that has demonstrated real latency constraints. Premature optimization of middleware is one of the most expensive mistakes in robotics startups — the opportunity cost of 6 months rebuilding DDS is enormous in the early stages. If latency becomes a real constraint later, you can always replace the transport layer while keeping ROS2 interfaces.

Decision 2: Simulation Platform

The simulation choice shapes your training loop for years. Three platforms dominate for different reasons:

  • NVIDIA Isaac Lab: Best choice if GPU-accelerated RL training is your primary use case. 4,000+ parallel environments on a single A100, tight Isaac Sim integration, MIT license, growing model zoo including Unitree G1 and Franka. Weakest on contact accuracy — fine for locomotion and pick-place, struggles for precision assembly.
  • MuJoCo: Best contact physics in any general-purpose simulator. Constraint-based dynamics with stable contact handling at high compliance. Choose this for dexterous hand research, contact-rich manipulation, or any task where contact accuracy matters more than parallelization. Now free under Apache 2.0 since Google acquisition.
  • Gazebo (Ignition): Choose only if deep ROS2 integration is a hard requirement — Gazebo is the official ROS2 simulator and has the tightest toolchain integration. Physics is weaker than both Isaac Lab and MuJoCo. Acceptable for navigation, sensor simulation, and integration testing.

Decision 3: Cloud vs. Edge Inference

Inference latency determines where your model runs. The threshold is approximately 150ms round-trip: below that, cloud inference from a co-located data center is viable for most applications. Above 150ms introduces perceptible lag in teleoperation and causes instability in closed-loop manipulation controllers.

Jetson AGX Orin (275 TOPS, $499) delivers sub-100ms inference for policies up to ~100M parameters and is the standard edge inference platform for robot deployments. For larger models (ACT, Diffusion Policy at full resolution), expect 120–200ms on Orin — acceptable for non-real-time tasks. Jetson Thor, launching in 2025, targets 1,000 TOPS and will push edge inference capability significantly.

Cloud inference makes sense when: robot connectivity is reliable (lab or factory floor with fiber), latency > 150ms is acceptable for the application, and model size exceeds what edge hardware can serve. A hybrid approach works well — run fast low-latency reactive controllers on-device, offload slow planning and perception to cloud.

Decision 4: Data Platform Strategy

The build-vs-buy decision on data infrastructure is simpler than it appears. Build your own data platform only if you have a dedicated ML infrastructure engineer on staff whose primary job is data tooling — not a researcher who also maintains tooling, a dedicated infra engineer. Otherwise, the maintenance burden compounds into a significant ongoing tax on your most expensive people.

The core capabilities you need: episode storage with versioning, metadata indexing for fast retrieval, visualization for QA, dataset splitting and export to training formats (HDF5/Zarr/RLDS), and access control for multi-operator environments. Building this from scratch takes 3–6 months and requires continuous maintenance.

The principle that should guide every stack decision: buy infrastructure, build differentiation. Your competitive advantage is your robot hardware, your task expertise, and your policy architecture — not your episode storage system. Spend engineering time accordingly.

The SVRC platform provides the full data infrastructure stack — collection, storage, annotation, training pipeline — as a managed service with API access.

Recommended Stack by Stage

StageMiddlewareSimInferenceData Platform
Pre-seed / SeedROS2Isaac Lab or MuJoCoEdge (Orin)SVRC or HuggingFace LeRobot
Series AROS2Isaac Lab + MuJoCoHybrid edge+cloudSVRC or build if infra hire
Series B+ROS2 or custom if proven needCustom + one of aboveCustom servingBuild if 2+ dedicated infra eng