ALOHA Robot: Stanford's Open-Source Bimanual System — Setup, Cost, and Alternatives (2026)

What Is ALOHA?

ALOHA stands for A Low-cost Open-source Hardware system for bimanual teleoperation. It was developed at Stanford University by Tony Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn, and introduced in the 2023 paper "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware" alongside ACT (Action Chunking with Transformers), the imitation learning algorithm designed to work with ALOHA data.

Before ALOHA, bimanual manipulation research required expensive industrial robots (two Franka Emika Pandas at $80,000+, or a Baxter at $25,000) and extensive custom engineering. ALOHA demonstrated that two consumer-grade robot arms in a leader-follower configuration could collect high-quality bimanual demonstration data that trains effective manipulation policies -- at roughly one-tenth the cost of prior systems.

The system has been replicated by dozens of labs worldwide and spawned several variants, most notably Mobile ALOHA (adding a wheeled base for whole-body teleoperation) and ALOHA 2 (Google DeepMind's improved version with better hardware tolerances). The original ALOHA hardware design, software, and training code are fully open source.

Why ALOHA Matters

ALOHA's significance is not the hardware itself -- it is fundamentally two off-the-shelf robot arms bolted to a table. Its impact comes from three contributions:

Democratized bimanual data collection. The leader-follower design lets any human operator demonstrate bimanual tasks without programming. The operator holds the leader arms and performs the task naturally; the follower arms mirror the motion while cameras record observations. This produced the first large-scale bimanual demonstration datasets at a cost accessible to university labs.
Action Chunking with Transformers (ACT). The ACT algorithm, introduced in the same paper, predicts chunks of 100 future actions at once rather than one action at a time. This dramatically reduces compounding errors that plagued prior imitation learning approaches, achieving 80-90% success rates on tasks where prior methods achieved 20-40%.
Open-source reproducibility. Every component -- CAD files, BOM, ROS2 drivers, data recording scripts, ACT training code, and pre-trained models -- is publicly available. This enabled rapid adoption and iteration across the robotics community.

Stationary ALOHA vs. Mobile ALOHA

The ALOHA platform exists in two primary configurations. Understanding the distinction is essential before deciding what to build or buy.

Feature	Stationary ALOHA	Mobile ALOHA
Base	Fixed table mount	AgileX Tracer wheeled base
Follower arms	2x ViperX 300 S2	2x ViperX 300 S2
Leader arms	2x WidowX 250 S	2x WidowX 250 S
Action space	14-DOF (7 per arm)	16-DOF (14 arm + 2 base velocity)
Cameras	2 wrist + 1 overhead	2 wrist + 1 overhead + 1 side (optional)
Hardware cost	$17,000-22,000	$28,000-35,000
Setup time	2-4 weeks	4-8 weeks
Best for	Tabletop bimanual tasks	Tasks requiring locomotion + manipulation

For most teams, the stationary ALOHA is the right starting point. Mobile ALOHA adds $10,000-15,000 in cost and significant integration complexity for the mobility capability. If your tasks happen at a single workstation -- assembly, packing, kitchen prep, lab manipulation -- stationary ALOHA (or an ALOHA-compatible alternative like the OpenArm DK1) is sufficient.

For a detailed cost breakdown of the mobile variant, see our Mobile ALOHA Cost Breakdown.

ALOHA System Architecture

Understanding ALOHA's architecture helps you evaluate whether to build one, buy an alternative, or outsource data collection entirely.

Hardware: Leader-Follower Teleoperation

ALOHA uses a leader-follower (also called master-slave) configuration. Each arm pair consists of:

Follower arm (ViperX 300 S2): The robot arm that executes tasks and interacts with objects. It has 6 DOF plus a gripper (7 total), uses Dynamixel XM/XH series servos, and has a payload of 750g at full extension. During teleoperation, it mirrors the leader arm's joint positions at 50 Hz.
Leader arm (WidowX 250 S): The arm held and moved by the human operator. It is lighter and shorter-reach than the follower, making it comfortable to manipulate for extended data collection sessions. It provides force feedback through gravity compensation -- the servos hold the arm's own weight, so the operator only feels the inertia of their intentional movements.

The leader and follower arms connect via USB to the same onboard computer through Dynamixel U2D2 adapters. The control loop reads joint positions from the leader arm at 50 Hz and sends matching position commands to the follower arm. The total end-to-end latency must stay below 10 ms for transparent teleoperation (the operator should not feel delay between their movements and the follower's response).

Camera System

ALOHA uses Intel RealSense depth cameras in three positions:

Two wrist cameras (RealSense D405): Mounted on the follower arm wrists, providing close-range views of the manipulation workspace. The D405's 7 cm minimum depth range makes it suitable for near-field sensing during grasping.
One overhead camera (RealSense D435): Mounted above the workspace, providing a top-down view for spatial context. This camera captures the relative positions of both arms and the workspace layout.

All three cameras must be temporally synchronized (within 10 ms) for the recorded data to train effective policies. The Intel Multi-Camera Sync Cable provides hardware-level synchronization. Without it, software timestamp alignment introduces 30-60 ms of jitter that degrades training performance by 10-20%.

Compute Stack

The onboard computer (typically an Intel NUC or equivalent mini-PC) handles three tasks simultaneously:

Real-time control: Reading leader joint positions and commanding follower joint positions at 50 Hz via Dynamixel servo bus.
Camera capture: Recording synchronized RGB frames from three cameras at 30 fps.
Data recording: Writing all sensor data (joint positions, velocities, camera frames, timestamps) to HDF5 files for later training.

No GPU is needed on the robot. Policy training happens offline on a separate workstation (RTX 4090 or cloud GPU). Policy deployment (inference) can run on the NUC's CPU for ACT models, which require less than 10 ms per forward pass.

Stationary ALOHA Cost Breakdown

The stationary ALOHA configuration (no mobile base) is significantly cheaper and faster to build.

Component	Cost	Notes
2x ViperX 300 S2 follower arms	$9,600	$4,800 each, Dynamixel XM/XH servos
2x WidowX 250 S leader arms	$6,200	$3,100 each, lighter weight for operator comfort
2x Custom gripper assemblies	$400-800	Parallel-jaw with Dynamixel XL330
3x Intel RealSense cameras	$950	2x D405 wrist ($300 ea) + 1x D435 overhead ($350)
Intel NUC 13 Pro (onboard computer)	$800-1,200	i7, 32GB RAM, 1TB NVMe
U2D2 adapters, USB hub, cables	$350-500	4x U2D2 ($50 ea), powered USB hub, cabling
Mounting hardware, table clamps	$200-400	Camera mast, arm base plates, clamps
Stationary ALOHA total	$18,500-19,700	Hardware only, before labor
Training workstation (RTX 4090)	$2,000-3,000	Separate desktop for offline training
Complete system total	$20,500-22,700	Excluding integration labor

This represents the minimum cost for a fully functional stationary ALOHA. Add $2,000-5,000 for professional assembly if you are outsourcing the build. For the mobile variant, add $10,000-15,000 for the AgileX base, mounting frame, additional battery, and integration. See our detailed Mobile ALOHA cost breakdown for the full mobile BOM.

Software Stack

The ALOHA software stack has four layers, each open source and well-documented.

1. ROS2 Humble (Real-Time Control)

ROS2 Humble on Ubuntu 22.04 provides the communication layer. Key packages:

Interbotix ROS2 packages: Drivers for the ViperX and WidowX arms. Provide joint state publishing and position command subscription at 50 Hz.
RealSense ROS2 wrapper: Camera driver nodes publishing RGB and depth images at 30 fps.
Teleoperation node: Reads leader arm joint positions and sends matching commands to follower arms at 50 Hz.

# Install ROS2 Humble and ALOHA dependencies
sudo apt install ros-humble-desktop python3-colcon-common-extensions

# Install Interbotix arm drivers
curl 'https://raw.githubusercontent.com/Interbotix/interbotix_ros_manipulators/main/interbotix_ros_xsarms/install/amd64/xsarm_amd64_install.sh' > xsarm_install.sh
chmod +x xsarm_install.sh && ./xsarm_install.sh -d humble

# Install RealSense drivers
sudo apt install ros-humble-librealsense2* ros-humble-realsense2-camera

2. ALOHA Repository (Teleoperation and Recording)

The official ALOHA GitHub repository provides scripts for leader-follower teleoperation and synchronized data recording. It handles the critical task of aligning joint state data (50 Hz) with camera frames (30 fps) into timestamped HDF5 episodes. Each episode contains:

/action: shape (T, 14) -- commanded joint positions (7 per arm)
/observations/qpos: shape (T, 14) -- measured joint positions
/observations/images/top: shape (T, 480, 640, 3) -- overhead camera
/observations/images/left_wrist: shape (T, 480, 640, 3)
/observations/images/right_wrist: shape (T, 480, 640, 3)

3. ACT Training Code (Offline Policy Learning)

ACT (Action Chunking with Transformers) is the default training algorithm for ALOHA data. Key architecture details:

Encoder: Processes current observations (joint positions + camera images) through a ResNet-18 image encoder and a joint state MLP.
Decoder: A transformer decoder with a CVAE (Conditional Variational Autoencoder) that predicts a chunk of 100 future actions at once.
Temporal ensemble: During deployment, overlapping action chunks are averaged with exponential weighting, smoothing the transitions between successive predictions.

# Train ACT policy on ALOHA data using LeRobot
python lerobot/scripts/train.py \
  policy=act \
  dataset_repo_id=your-username/aloha-task \
  env=aloha \
  training.num_epochs=2000 \
  training.batch_size=8 \
  policy.chunk_size=100 \
  policy.kl_weight=10

4. LeRobot (Modern Integration Layer)

LeRobot from Hugging Face has become the standard integration layer for ALOHA-class hardware. It provides standardized data recording, dataset management (push to HuggingFace Hub), training pipelines for ACT/Diffusion Policy/TDMPC2, and deployment scripts. For new ALOHA builds, we recommend using LeRobot from the start rather than the original ALOHA repository scripts.

Setup Guide Overview

Building and configuring an ALOHA system involves five phases. For detailed step-by-step instructions, see our Mobile ALOHA Setup Guide (which covers both stationary and mobile configurations).

Phase 1: Hardware Assembly (3-5 Days)

Mount follower arms to the table or frame, install leader arms at operator height, assemble grippers, mount cameras, and connect all electronics. Critical details: arm base plates must be co-planar to within 1 mm; camera mounting must be rigid (any flex produces inconsistent data). Torque all bolts to specification and apply threadlocker to vibration-prone joints.

Phase 2: ROS2 Setup and Calibration (3-7 Days)

Install Ubuntu 22.04, ROS2 Humble, Interbotix drivers, and RealSense drivers. Configure Dynamixel servo parameters: set baud rate to 1M, calibrate zero positions for each joint, set PID gains, and configure gravity compensation on leader arms. Verify USB device enumeration order (this is a common source of bugs when devices are reassigned after reboot).

Phase 3: First Teleoperation Test (1-2 Days)

Run the teleoperation node and verify that both arm pairs track correctly. Check: leader-to-follower latency (must be under 10 ms), tracking accuracy (follower should match leader position to within 2 degrees), gripper synchronization (open/close states match), and camera feeds (all three cameras producing clean frames at 30 fps). Fix any issues before proceeding to data collection.

Phase 4: Recording Demonstrations (Ongoing)

Define your task, configure the recording script, and begin collecting demonstrations. Start with a simple validation task (pick and place) to verify the full pipeline before investing time in complex tasks. Collect 50 demonstrations for the validation task, train a policy, and evaluate. Target: 60%+ success rate. If below 40%, debug hardware/calibration before collecting more data.

Phase 5: Training ACT Policy (4-8 Hours per Training Run)

Push your dataset to HuggingFace Hub (or train locally), configure ACT hyperparameters, and train. Key hyperparameters: chunk_size=100, kl_weight=10, batch_size=8, num_epochs=2000. Watch validation loss -- training is done when loss plateaus for 200+ epochs. Deploy the trained policy back on the robot for evaluation.

Performance Benchmarks

These success rates are from the original ACT paper (Zhao et al., 2023) and subsequent ALOHA publications. Your results will vary based on hardware calibration quality, number of demonstrations, and operator skill.

Task	Success Rate	Demos Used	Notes
Slot insertion (bimanual)	91%	50	One arm holds socket, other inserts battery
Fork retrieval from pot	82%	50	Requires precise grasp in confined space
Coffee making (multi-step)	62%	50	Long-horizon, 5+ sequential sub-tasks
Table bussing (mobile)	85%	50 mobile + 370 static	Mobile ALOHA with co-training
Door opening (mobile)	76%	50 mobile + 370 static	Whole-body coordination required
Object handover to human (mobile)	68%	50 mobile + 370 static	Requires detecting human presence and timing

Key insight from the benchmarks: Co-training matters enormously for mobile tasks. The mobile tasks above used only 50 mobile demonstrations but were co-trained with 370 static bimanual demonstrations. Without co-training, the same 50 mobile demonstrations achieved 20-40% lower success rates. This means you can collect most of your data on a cheaper stationary setup and add a small amount of mobile data to get mobile policies.

ALOHA Alternatives Comparison

Several systems offer ALOHA-like bimanual teleoperation capability at different price points and with different tradeoffs.

System	Cost	Key Advantage	Key Limitation
Stationary ALOHA (DIY)	$20,500-22,700	Original platform, large community	Requires DIY assembly and ROS2 expertise
Mobile ALOHA (DIY)	$28,000-35,000	Mobility + bimanual, co-training data	Complex build, high total cost
OpenArm DK1	$12,000	Ships assembled, higher-torque servos, warranty	Stationary only, smaller community
OpenArm 1	$4,500	Lowest-cost entry to robot learning	Single arm only (no bimanual)
ALOHA 2 (Google DeepMind)	N/A	Improved hardware, better tolerances	Not commercially available
UMI (Universal Manipulation Interface)	$2,000-3,000	No robot needed for data collection	Single-arm, camera-only (no depth/force)
SVRC DK1 Lease	$2,500/mo	No upfront cost, includes maintenance	Monthly recurring cost
SVRC Data Services	$2,500 pilot	No hardware at all, expert operators	Limited to SVRC's task environments

Why Consider the OpenArm DK1 Over ALOHA?

The OpenArm DK1 ($12,000) was designed specifically as an ALOHA-compatible alternative that eliminates the DIY build process. Key advantages over a self-built ALOHA:

Ships assembled and calibrated. No 4-8 weeks of build time. Collect your first demonstration on day one.
Higher-torque servos. OpenArm's custom actuators provide 30% more torque than the Dynamixel XM430 servos in the ViperX arms, with better thermal management that reduces overheating-related downtime.
ALOHA-compatible data format. Records the same 14-DOF action space and HDF5 episode format as ALOHA. Policies trained on DK1 data can be fine-tuned on ALOHA data and vice versa.
90-day warranty and support. Includes replacement parts and technical support. A DIY ALOHA build has no warranty.
44% lower cost. $12,000 vs $20,500-22,700 for a comparable stationary ALOHA.

How SVRC Helps with ALOHA Projects

Whether you are building an ALOHA from scratch, looking for an alternative, or just need bimanual data, SVRC provides multiple paths to get you results.

Pre-Built Hardware

The OpenArm DK1 ($12,000) ships fully assembled and calibrated, ready for data collection on day one. For single-arm projects, the OpenArm 1 ($4,500) provides an even lower entry point. Both systems integrate with LeRobot and produce ALOHA-compatible data formats.

Expert Operator Data Collection

SVRC's data collection service provides professionally collected bimanual demonstration datasets. Our operators are trained in ALOHA-style teleoperation with consistent technique that produces high-quality training data. We deliver datasets in HDF5, LeRobot, or RLDS format -- compatible with ACT, Diffusion Policy, and other imitation learning algorithms.

Pilot project: $2,500 -- 50-100 demonstrations for a single task, delivered in 2 weeks
Full campaign: $8,000 -- 200-500 demonstrations across multiple tasks, with annotation and quality validation

Robot Leasing

SVRC's robot leasing program provides fully assembled, calibrated bimanual systems on monthly lease terms. Lease a DK1 for $2,500/month (includes maintenance and support) and collect data in your own environment. This eliminates upfront capital investment and lets you start immediately.

Lab Access

SVRC operates equipped robotics labs at San Francisco, CA and Allston, MA. If you need occasional access to ALOHA-class hardware for data collection, prototyping, or evaluation, contact us about lab access arrangements.

Frequently Asked Questions

What is the difference between ALOHA and ACT?

ALOHA is the hardware platform (robot arms, cameras, teleoperation setup). ACT (Action Chunking with Transformers) is the software algorithm that trains manipulation policies from ALOHA data. They were introduced in the same paper but are independent: you can collect ALOHA data and train it with Diffusion Policy instead of ACT, or you can train ACT on data from non-ALOHA hardware. See our ACT guide for details on the algorithm.

Can I use ALOHA for tasks beyond manipulation?

ALOHA is specifically designed for bimanual manipulation. Its arms have limited payload (750g at full extension) and reach (300 mm), making it unsuitable for heavy lifting, high-precision assembly (tolerances tighter than 2 mm), or tasks requiring force control beyond the servo's current sensing capability. For dexterous manipulation requiring individual finger control, see the Orca Hand or Paxini Gen3 tactile gloves.

How many demonstrations do I need?

For a simple bimanual task with limited variability (fixed object positions, single approach strategy): 50 demonstrations can achieve 60-80% success rate with ACT. For tasks with high variability (random object positions, multiple valid approaches): 200-500 demonstrations are needed for robust performance. For multi-step tasks (5+ sequential sub-tasks): 300-1,000 demonstrations. Our cost per demonstration analysis provides detailed guidance.

Is ALOHA still relevant in 2026?

Yes. Despite newer platforms, ALOHA remains the most widely replicated and best-documented bimanual manipulation system. The large community means more shared datasets, pre-trained models, and troubleshooting resources than any alternative. The ALOHA data format (14-DOF joint positions + multi-view images) has become a de facto standard for bimanual learning, and most new algorithms are benchmarked on ALOHA tasks.

Can I add dexterous hands to ALOHA?

Yes, though it requires significant modification. The standard ALOHA gripper is a simple 1-DOF parallel jaw. Replacing it with a multi-finger hand (such as the Orca Hand, 17-DOF) dramatically increases the action space dimension and the complexity of teleoperation. You also need a different teleoperation interface -- the standard leader-follower approach does not extend naturally to dexterous hands. Glove-based teleoperation (using Paxini Gen3 tactile gloves) is the current best practice for dexterous hand control.

Where can I find ALOHA datasets to test my algorithms?

Several public ALOHA datasets are available on HuggingFace Hub: the original ACT paper datasets (slot insertion, fork retrieval, coffee making), Mobile ALOHA datasets (table bussing, door opening), and community-contributed datasets for various tasks. SVRC also maintains curated datasets on our datasets page.

ALOHA Robot: Stanford's Open-Source Bimanual Teleoperation System