Fleet Management Stack Components
A production-grade fleet management system has five essential components. Most teams build these incrementally — start with telemetry and remote access, then add the update system and alerting as the fleet scales past 10 robots.
- Device registry: A database of every robot in the fleet — serial number, model, firmware version, location, assigned operator, and current status. This is the source of truth for all other systems. A simple PostgreSQL table works; purpose-built solutions like AWS IoT or Azure IoT Hub provide this at scale.
- Telemetry pipeline: Streaming metrics from robot to cloud. Typically MQTT or gRPC from the robot, ingested into a time-series database (InfluxDB or TimescaleDB). Target <10 second latency for operational metrics, <1 second for safety-critical signals.
- Remote access layer: Authenticated access for operators and engineers to inspect and control robots without physical presence. Discussed in detail in the Remote Access Methods section.
- OTA update system: Mechanism to push firmware, software, and policy updates to robots in the field. Staged rollouts and rollback capability are non-negotiable.
- Alerting and on-call: Automated detection of anomalies with escalation to on-call engineers. PagerDuty or OpsGenie for on-call rotation management.
Telemetry to Collect
Collect only what you will act on. Excessive telemetry increases costs and creates noise. These are the metrics that consistently prove valuable:
- Joint temperatures: Per-joint motor temperature in °C. Alert at 70°C (warning), emergency stop at 85°C. Elevated temperature is an early indicator of increased friction, impending bearing failure, or overloaded task profiles.
- Joint error codes: Any error code from the motor driver, logged with timestamp and joint ID. Error codes should be decoded to human-readable descriptions in your monitoring dashboard.
- Battery state of charge and voltage: Battery % and voltage at 1-minute intervals. Track charge cycle count for battery lifecycle management.
- Task success/failure rate: Per-episode result with task type, duration, and failure mode. The primary business KPI.
- Network latency: Round-trip latency from robot to cloud at 30-second intervals. Latency spikes >200 ms indicate network issues that will degrade teleoperation quality.
- Camera health: Frame rate and dropped frame count per camera. A camera dropping >5% of frames indicates a hardware issue or USB bandwidth saturation.
Monitoring Infrastructure
The standard open-source monitoring stack works well for robot fleets up to ~200 units:
- Prometheus + Grafana: Prometheus scrapes metrics endpoints exposed by each robot's fleet agent at 15-second intervals. Grafana visualizes fleet-level dashboards: total uptime, per-robot health, task throughput, and alert history. Pre-built Grafana dashboards for robot fleets are available at grafana.com/grafana/dashboards.
- InfluxDB: For high-frequency telemetry (joint positions at 100 Hz), use InfluxDB's time-series compression rather than Prometheus (which is not optimized for high-cardinality, high-frequency data).
- PagerDuty: Manages on-call rotations and alert escalation. Integrate Prometheus alertmanager → PagerDuty for automated incident creation. Define separate escalation policies for safety alerts (immediate page) vs. maintenance alerts (business hours only).
- Custom fleet health dashboard: Build a single-screen "mission control" view in the platform showing: map of all robot locations with status indicators, top 5 failing tasks, fleet uptime percentage, and robots requiring maintenance.
Remote Access Methods
| Method | Latency | Security | Best For |
|---|---|---|---|
| SSH over VPN (WireGuard) | 20–80ms depending on VPN server location | High — key-based auth, encrypted tunnel | Engineering diagnostics, log review, config changes |
| WebRTC remote desktop | 50–150ms | Medium — requires signaling server security | Operator GUI access, rviz2 visualization |
| ROS2 bridge (rosbridge_suite) | 30–100ms | Low by default — add TLS + auth explicitly | Programmatic telemetry access, remote monitoring scripts |
WireGuard VPN is the recommended foundation for all remote access. Deploy a WireGuard server (e.g., on a $5/month DigitalOcean droplet) and configure each robot as a WireGuard client with a unique key pair. All remote access happens over the VPN tunnel — SSH, web dashboards, and ROS2 bridge traffic are all tunneled, eliminating the need to expose any robot port directly to the internet.
Alert Thresholds
| Metric | Warning Threshold | Emergency Threshold | Automated Action |
|---|---|---|---|
| Joint temperature | >70°C | >85°C | Emergency: immediate e-stop |
| Task success rate (7-day rolling) | <80% | <60% | Emergency: suspend policy, alert on-call |
| Battery SoC | <20% | <10% | Emergency: return to charger or alert operator |
| Network latency (robot→cloud) | >200ms | >500ms | Warning: log; Emergency: disable teleoperation |
| Camera frame drop rate | >5% | >20% | Warning: log; Emergency: pause data collection |
| Consecutive task failures | 3 in a row | 5 in a row | Warning: operator alert; Emergency: suspend + escalate |
OTA Update Process
Over-the-air updates are how you ship improvements and security fixes to deployed robots without site visits. A disciplined update process prevents updates from causing incidents:
- Build: Every update (firmware, software, or policy) is built in CI and produces a versioned artifact with a sha256 checksum. Artifacts are stored in a release registry (S3 bucket or Artifact Registry).
- Staging test: Before any field deployment, the update is applied to 2–3 staging robots in the lab and validated with a 50-trial automated test suite.
- Canary rollout (10%): Deploy to 10% of the fleet (or a minimum of 3 robots) for 48 hours. Monitor success rate, error codes, and telemetry for regressions.
- Full rollout: If canary metrics are nominal, roll out to the remaining fleet. Stagger the rollout at 25% per hour to avoid simultaneous restarts causing fleet-wide downtime.
- Rollback capability: Every robot maintains the previous version artifact locally. Rollback takes <2 minutes and can be triggered per-robot or fleet-wide from the management dashboard.
Fleet KPIs
| KPI | Definition | Target | Measurement Interval |
|---|---|---|---|
| MTBF (Mean Time Between Failures) | Average operating hours between unplanned stoppages | >200 hours | Monthly |
| MTTR (Mean Time to Repair) | Average time from incident detection to resumed operation | <2 hours | Monthly |
| Fleet uptime | % of scheduled operating hours spent in active operation | >95% | Weekly |
| Task completion rate | % of tasks completed successfully without human intervention | >90% | Daily |
| OTA update success rate | % of update deployments that succeed without rollback | >99% | Per-release |