Decision 1: ROS2 vs. Custom Middleware
This is the question every robotics startup debates and most get wrong by over-engineering. ROS2 gets you a running robot faster, gives you access to a large ecosystem of drivers and tools, and makes it easier to hire — most robotics engineers know ROS2. The pub/sub architecture handles the complexity of multi-node systems, and packages like MoveIt2, Nav2, and ros2_control handle problems that would take months to implement from scratch.
Custom middleware delivers two genuine advantages: latency below 5ms (ROS2's DDS overhead makes this nearly impossible) and freedom from licensing concerns if your commercial deployment model involves redistributing a modified middleware layer. There's also a real argument that custom middleware is simpler to debug when you fully own the stack.
The practical rule: use ROS2 unless you are at Series B or later with a shipping product that has demonstrated real latency constraints. Premature optimization of middleware is one of the most expensive mistakes in robotics startups — the opportunity cost of 6 months rebuilding DDS is enormous in the early stages. If latency becomes a real constraint later, you can always replace the transport layer while keeping ROS2 interfaces.
Decision 2: Simulation Platform
The simulation choice shapes your training loop for years. Three platforms dominate for different reasons:
- NVIDIA Isaac Lab: Best choice if GPU-accelerated RL training is your primary use case. 4,000+ parallel environments on a single A100, tight Isaac Sim integration, MIT license, growing model zoo including Unitree G1 and Franka. Weakest on contact accuracy — fine for locomotion and pick-place, struggles for precision assembly.
- MuJoCo: Best contact physics in any general-purpose simulator. Constraint-based dynamics with stable contact handling at high compliance. Choose this for dexterous hand research, contact-rich manipulation, or any task where contact accuracy matters more than parallelization. Now free under Apache 2.0 since Google acquisition.
- Gazebo (Ignition): Choose only if deep ROS2 integration is a hard requirement — Gazebo is the official ROS2 simulator and has the tightest toolchain integration. Physics is weaker than both Isaac Lab and MuJoCo. Acceptable for navigation, sensor simulation, and integration testing.
Decision 3: Cloud vs. Edge Inference
Inference latency determines where your model runs. The threshold is approximately 150ms round-trip: below that, cloud inference from a co-located data center is viable for most applications. Above 150ms introduces perceptible lag in teleoperation and causes instability in closed-loop manipulation controllers.
Jetson AGX Orin (275 TOPS, $499) delivers sub-100ms inference for policies up to ~100M parameters and is the standard edge inference platform for robot deployments. For larger models (ACT, Diffusion Policy at full resolution), expect 120–200ms on Orin — acceptable for non-real-time tasks. Jetson Thor, launching in 2025, targets 1,000 TOPS and will push edge inference capability significantly.
Cloud inference makes sense when: robot connectivity is reliable (lab or factory floor with fiber), latency > 150ms is acceptable for the application, and model size exceeds what edge hardware can serve. A hybrid approach works well — run fast low-latency reactive controllers on-device, offload slow planning and perception to cloud.
Decision 4: Data Platform Strategy
The build-vs-buy decision on data infrastructure is simpler than it appears. Build your own data platform only if you have a dedicated ML infrastructure engineer on staff whose primary job is data tooling — not a researcher who also maintains tooling, a dedicated infra engineer. Otherwise, the maintenance burden compounds into a significant ongoing tax on your most expensive people.
The core capabilities you need: episode storage with versioning, metadata indexing for fast retrieval, visualization for QA, dataset splitting and export to training formats (HDF5/Zarr/RLDS), and access control for multi-operator environments. Building this from scratch takes 3–6 months and requires continuous maintenance.
The principle that should guide every stack decision: buy infrastructure, build differentiation. Your competitive advantage is your robot hardware, your task expertise, and your policy architecture — not your episode storage system. Spend engineering time accordingly.
The SVRC platform provides the full data infrastructure stack — collection, storage, annotation, training pipeline — as a managed service with API access.
Recommended Stack by Stage
| Stage | Middleware | Sim | Inference | Data Platform |
|---|---|---|---|---|
| Pre-seed / Seed | ROS2 | Isaac Lab or MuJoCo | Edge (Orin) | SVRC or HuggingFace LeRobot |
| Series A | ROS2 | Isaac Lab + MuJoCo | Hybrid edge+cloud | SVRC or build if infra hire |
| Series B+ | ROS2 or custom if proven need | Custom + one of above | Custom serving | Build if 2+ dedicated infra eng |