Fleet Management Stack Components

A production-grade fleet management system has five essential components. Most teams build these incrementally — start with telemetry and remote access, then add the update system and alerting as the fleet scales past 10 robots.

  • Device registry: A database of every robot in the fleet — serial number, model, firmware version, location, assigned operator, and current status. This is the source of truth for all other systems. A simple PostgreSQL table works; purpose-built solutions like AWS IoT or Azure IoT Hub provide this at scale.
  • Telemetry pipeline: Streaming metrics from robot to cloud. Typically MQTT or gRPC from the robot, ingested into a time-series database (InfluxDB or TimescaleDB). Target <10 second latency for operational metrics, <1 second for safety-critical signals.
  • Remote access layer: Authenticated access for operators and engineers to inspect and control robots without physical presence. Discussed in detail in the Remote Access Methods section.
  • OTA update system: Mechanism to push firmware, software, and policy updates to robots in the field. Staged rollouts and rollback capability are non-negotiable.
  • Alerting and on-call: Automated detection of anomalies with escalation to on-call engineers. PagerDuty or OpsGenie for on-call rotation management.

Telemetry to Collect

Collect only what you will act on. Excessive telemetry increases costs and creates noise. These are the metrics that consistently prove valuable:

  • Joint temperatures: Per-joint motor temperature in °C. Alert at 70°C (warning), emergency stop at 85°C. Elevated temperature is an early indicator of increased friction, impending bearing failure, or overloaded task profiles.
  • Joint error codes: Any error code from the motor driver, logged with timestamp and joint ID. Error codes should be decoded to human-readable descriptions in your monitoring dashboard.
  • Battery state of charge and voltage: Battery % and voltage at 1-minute intervals. Track charge cycle count for battery lifecycle management.
  • Task success/failure rate: Per-episode result with task type, duration, and failure mode. The primary business KPI.
  • Network latency: Round-trip latency from robot to cloud at 30-second intervals. Latency spikes >200 ms indicate network issues that will degrade teleoperation quality.
  • Camera health: Frame rate and dropped frame count per camera. A camera dropping >5% of frames indicates a hardware issue or USB bandwidth saturation.

Monitoring Infrastructure

The standard open-source monitoring stack works well for robot fleets up to ~200 units:

  • Prometheus + Grafana: Prometheus scrapes metrics endpoints exposed by each robot's fleet agent at 15-second intervals. Grafana visualizes fleet-level dashboards: total uptime, per-robot health, task throughput, and alert history. Pre-built Grafana dashboards for robot fleets are available at grafana.com/grafana/dashboards.
  • InfluxDB: For high-frequency telemetry (joint positions at 100 Hz), use InfluxDB's time-series compression rather than Prometheus (which is not optimized for high-cardinality, high-frequency data).
  • PagerDuty: Manages on-call rotations and alert escalation. Integrate Prometheus alertmanager → PagerDuty for automated incident creation. Define separate escalation policies for safety alerts (immediate page) vs. maintenance alerts (business hours only).
  • Custom fleet health dashboard: Build a single-screen "mission control" view in the platform showing: map of all robot locations with status indicators, top 5 failing tasks, fleet uptime percentage, and robots requiring maintenance.

Remote Access Methods

MethodLatencySecurityBest For
SSH over VPN (WireGuard)20–80ms depending on VPN server locationHigh — key-based auth, encrypted tunnelEngineering diagnostics, log review, config changes
WebRTC remote desktop50–150msMedium — requires signaling server securityOperator GUI access, rviz2 visualization
ROS2 bridge (rosbridge_suite)30–100msLow by default — add TLS + auth explicitlyProgrammatic telemetry access, remote monitoring scripts

WireGuard VPN is the recommended foundation for all remote access. Deploy a WireGuard server (e.g., on a $5/month DigitalOcean droplet) and configure each robot as a WireGuard client with a unique key pair. All remote access happens over the VPN tunnel — SSH, web dashboards, and ROS2 bridge traffic are all tunneled, eliminating the need to expose any robot port directly to the internet.

Alert Thresholds

MetricWarning ThresholdEmergency ThresholdAutomated Action
Joint temperature>70°C>85°CEmergency: immediate e-stop
Task success rate (7-day rolling)<80%<60%Emergency: suspend policy, alert on-call
Battery SoC<20%<10%Emergency: return to charger or alert operator
Network latency (robot→cloud)>200ms>500msWarning: log; Emergency: disable teleoperation
Camera frame drop rate>5%>20%Warning: log; Emergency: pause data collection
Consecutive task failures3 in a row5 in a rowWarning: operator alert; Emergency: suspend + escalate

OTA Update Process

Over-the-air updates are how you ship improvements and security fixes to deployed robots without site visits. A disciplined update process prevents updates from causing incidents:

  • Build: Every update (firmware, software, or policy) is built in CI and produces a versioned artifact with a sha256 checksum. Artifacts are stored in a release registry (S3 bucket or Artifact Registry).
  • Staging test: Before any field deployment, the update is applied to 2–3 staging robots in the lab and validated with a 50-trial automated test suite.
  • Canary rollout (10%): Deploy to 10% of the fleet (or a minimum of 3 robots) for 48 hours. Monitor success rate, error codes, and telemetry for regressions.
  • Full rollout: If canary metrics are nominal, roll out to the remaining fleet. Stagger the rollout at 25% per hour to avoid simultaneous restarts causing fleet-wide downtime.
  • Rollback capability: Every robot maintains the previous version artifact locally. Rollback takes <2 minutes and can be triggered per-robot or fleet-wide from the management dashboard.

Fleet KPIs

KPIDefinitionTargetMeasurement Interval
MTBF (Mean Time Between Failures)Average operating hours between unplanned stoppages>200 hoursMonthly
MTTR (Mean Time to Repair)Average time from incident detection to resumed operation<2 hoursMonthly
Fleet uptime% of scheduled operating hours spent in active operation>95%Weekly
Task completion rate% of tasks completed successfully without human intervention>90%Daily
OTA update success rate% of update deployments that succeed without rollback>99%Per-release