Setup Guide

Mobile ALOHA Setup Guide: Hardware, Software, and First Demo

Mobile ALOHA is one of the most influential bimanual manipulation platforms to emerge from academic research. Getting it running end-to-end — hardware assembled, arms calibrated, software stack live, and first demonstration recorded — takes careful attention to each layer of the system.

Hardware Assembly Overview

A Mobile ALOHA system consists of a wheeled mobile base (typically an AgileX Tracer or equivalent differential-drive platform) with two ViperX 300 or similar 6-DOF arms mounted on a raised chassis. The bimanual setup requires matching pairs of leader and follower arms: leader arms are lighter, back-drivable, and held by the human operator during teleoperation; follower arms are the robot arms that mirror the leader motions in real time.

Assembly begins with mounting the follower arms to the chassis at the correct height and lateral offset to match the leader arm ergonomics. A mismatch between leader and follower geometry is a common source of control quality issues. The camera stack — typically a wrist-mounted camera on each follower arm plus one or two overhead cameras — should be installed and secured before any software calibration begins. Cable management matters more than it looks: loose cables interrupt episodes and generate bad data.

Leader-Follower Calibration

Calibration is the step most teams rush and most teams regret. The leader and follower arms must be in matching joint-zero positions before you record a single episode. Most ViperX-based setups ship with physical calibration fixtures — use them. After mechanical zero-ing, software calibration captures the joint offset between leader and follower at the zero pose and stores it as a bias correction applied in real time during teleoperation.

Test calibration quality by commanding the leader arms slowly through their workspace and watching the follower arms track. Any persistent joint-space lag, drift at specific joint angles, or asymmetric response between left and right indicates a calibration error that will degrade your dataset. Re-calibrate before beginning any data collection campaign, and re-verify calibration after shipping the system or making mechanical adjustments.

Software Stack: ACT and LeRobot

The original Mobile ALOHA paper used the ACT (Action Chunking with Transformers) policy trained on demonstration data. The software stack comprises three layers: a low-level control layer running on the robot's embedded compute, a teleoperation recording layer that captures joint states and camera frames synchronously, and a training layer where ACT or another policy is trained on the collected dataset.

LeRobot from Hugging Face has become the standard open-source framework for this workflow. It provides a unified data format, recording scripts for ALOHA-style hardware, and training pipelines for ACT, Diffusion Policy, and TDMPC. SVRC's data platform exports datasets in LeRobot-compatible format, making it straightforward to train on SVRC-collected data or to upload your own demonstrations for storage and versioning.

Recording Your First Data Collection Session

Before recording, define the task precisely. "Pick up the cup" is too vague — specify the cup's starting location, orientation, and target placement. Consistency in task setup is what makes demonstration datasets learnable. Prepare 3–5 reset procedures to quickly return the workspace to the starting state between episodes.

For a first session, aim for 50 successful demonstrations of a single, cleanly defined task. Record at 30 Hz or higher. Annotate each episode with a success flag immediately after recording — do not leave annotation for later. SVRC recommends recording in at least two different lighting conditions and with minor variations in object placement to build in diversity from the start. The SVRC data services platform provides episode browser and annotation tools to streamline this workflow.

Common Issues and How to Fix Them

The most frequent problems with new Mobile ALOHA setups fall into four categories. First, leader-follower lag: usually caused by network latency on the control loop — ensure leader and follower are on the same local machine or connected via a dedicated Ethernet link, not WiFi. Second, camera synchronization drift: if wrist and overhead cameras are not hardware-synced, use timestamp-based alignment during data loading rather than frame index alignment. Third, arm collision during bimanual tasks: add soft joint limits and collision meshes in the URDF before intensive training. Fourth, base motion interfering with arm demonstrations: when collecting manipulation-only data, engage the base lock to prevent drift.

Next Steps After Your First Demo

Once you have a clean 50-episode dataset, use the LeRobot training pipeline to train an ACT policy. Expect first-attempt success rates of 40–60% on a well-defined task with clean data — this is normal and improves rapidly with more demonstrations and data diversity. As you scale, SVRC's data collection services can augment your dataset with professionally collected episodes using standardized hardware. For hardware sourcing or to lease a bimanual system, visit our hardware catalog or contact the SVRC team.