ALOHA Robot: What It Is, How It Works, and How to Get Started
ALOHA is the bimanual teleoperation platform from Stanford University that demonstrated, for the first time, that a robot could learn dexterous two-handed manipulation tasks — like opening a bag of chips, tying a cable, or cooking — from a small number of human demonstrations. It is now the most widely referenced bimanual research platform in the world. This guide explains what ALOHA is, how it works, and how to start using it.
The Stanford Origin Story
ALOHA — A Low-cost Open-source Hardware System for Bimanual Teleoperation — was developed at Stanford's Mobile Manipulation Lab and published in the paper "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware" by Tony Z. Zhao et al. in 2023. The central thesis was provocative: you do not need expensive, proprietary robot hardware to perform impressive dexterous manipulation. ALOHA used four ViperX 300 and WidowX 250 robot arms (two per side, one as the leader for teleoperation and one as the follower) costing under $20,000 total, combined with the ACT algorithm, to perform tasks that had previously required custom-engineered systems costing many times more.
The paper demonstrated 10 bimanual tasks including unwrapping a piece of candy, inserting a battery into a slot, and threading a rope through a hole — all with success rates above 80% using 50 demonstrations. These results shocked the robotics community not because the tasks were novel, but because of the cost and data efficiency. ALOHA and ACT together established a new benchmark for accessible dexterous manipulation research and triggered a wave of follow-on work that continues today.
The ALOHA hardware design and all software are fully open-source. The bill of materials, assembly instructions, and ACT training code are publicly available on GitHub. This openness has made ALOHA the de facto standard bimanual research platform, with dozens of research groups worldwide running variants of the original design. SVRC supports ALOHA-class platforms through our data services and hardware leasing program.
Hardware Architecture: Bimanual Leader-Follower Setup
The ALOHA system consists of two kinematic pairs, one for each arm. Each pair has a "leader" arm — a lightweight, back-drivable arm that the operator holds and moves with their hands — and a "follower" arm that mirrors the leader's joint positions in real time. The follower arm carries the actual manipulator (gripper, tool, or end-effector) and interacts with the physical world. The leader arm has no end-effector payload requirements because it only needs to be back-drivable and provide torque feedback to the operator.
The bimanual configuration — two complete leader-follower pairs — is what makes ALOHA uniquely capable for dexterous tasks. Human hands are bimanual by nature: one hand holds the object while the other manipulates it, or both hands cooperate to complete a task that requires two simultaneous contact points. Single-arm robots can only approximate these tasks with complex fixtures or sequencing; bimanual robots can handle them directly. The ALOHA form factor, with both arms mounted on a shared table fixture, is optimized for tabletop manipulation tasks where the operator sits in front of the system.
The camera setup in the original ALOHA paper used three cameras: one overhead (bird's-eye view of the full workspace), one on the left wrist, and one on the right wrist. All three cameras are used as visual observations for the ACT policy. This multi-view setup is critical: the wrist cameras provide close-up views of grasping and contact events, while the overhead camera provides global context for two-handed coordination. Single-camera ALOHA variants show measurably lower policy performance on coordination-heavy tasks.
ACT: The Algorithm Behind ALOHA
ACT (Action Chunking with Transformers) was developed alongside ALOHA and is the primary learning algorithm for the platform. ACT is a transformer-based imitation learning policy that predicts a chunk of future joint positions — typically 100 timesteps at 50Hz, covering 2 seconds of motion — rather than a single next action. This action chunking architecture substantially reduces the compounding error problem of naive behavioral cloning, where small prediction mistakes at each timestep accumulate into large trajectory deviations over the course of a task.
The ACT policy architecture uses a CVAE (Conditional Variational Autoencoder) encoder during training to capture the latent style of each demonstration — essentially, a compressed representation of "how" the human completed the task, distinct from "what" the task outcome was. This enables the policy to model the natural variation in human demonstrations without mode-averaging artifacts. At inference time, only the CVAE decoder runs, conditioned on the current observation and a sampled latent vector, to generate the action chunk.
Training ACT on an ALOHA dataset with 50 demonstrations per task takes 2–4 hours on a single RTX 3090 GPU. The training code, released with the original paper, is straightforward to run with documented hyperparameters for standard ALOHA tasks. For custom tasks, the most impactful hyperparameter to tune is the chunk size (kl_weight in the config) — larger chunks improve temporal consistency at the cost of reactivity to unexpected perturbations. SVRC's platform includes pre-configured ACT training pipelines for ALOHA-format datasets.
Mobile ALOHA: Taking ALOHA Off the Table
Mobile ALOHA, published by the same Stanford group in 2024, extended the ALOHA concept to a mobile base. The bimanual arm setup was mounted on an AgileX Tracer mobile base, enabling the system to navigate to different locations within a space — approaching a kitchen counter, moving to a dining table, navigating a hallway — while retaining the ALOHA arms for manipulation. Mobile ALOHA demonstrated tasks like cooking shrimp on a stove, loading a dishwasher, and delivering a package — tasks that require both locomotion and dexterous manipulation.
Mobile ALOHA introduced the concept of whole-body teleoperation: the operator controls both the mobile base and the two arms simultaneously, either through separate control interfaces or through a unified interface that maps the operator's body movements to the robot's whole-body configuration. Data collection for Mobile ALOHA is significantly more complex than tabletop ALOHA because the policy must learn to coordinate navigation and manipulation, requiring demonstrations that cover spatial variation in the environment as well as object variation.
Mobile ALOHA also introduced co-training: training the Mobile ALOHA policy jointly on mobile manipulation demonstrations and static ALOHA manipulation demonstrations. The co-training improved manipulation performance on the mobile platform, suggesting that the bimanual manipulation knowledge from tabletop data transfers usefully to the mobile context. SVRC offers Mobile ALOHA-compatible datasets and can collect mobile manipulation demonstrations at our Palo Alto facility. Contact us to discuss your Mobile ALOHA data requirements.
Differences Between ALOHA, ALOHA 2, and Commercial Derivatives
ALOHA 2, published in late 2024, improved on the original in several dimensions: higher-quality arms with better repeatability, an improved camera mounting system, and a revised wrist design that reduces cable routing complexity. The electrical system was also updated to use a dedicated power distribution board rather than daisy-chained power cables, improving reliability during long data collection sessions. ALOHA 2 maintains full software compatibility with the original — datasets collected on one can train policies evaluated on the other, subject to the usual caveats about hardware variation.
Several commercial vendors now sell ALOHA-compatible platforms — pre-assembled, tested systems that follow the ALOHA mechanical and software specification without requiring the builder to source components and assemble the arms themselves. These commercial ALOHA systems cost more than the DIY bill of materials but substantially reduce setup time and the risk of assembly errors. SVRC's hardware catalog includes ALOHA-compatible configurations; see the store for current options and pricing.
Getting Started with ALOHA Through SVRC
SVRC supports ALOHA-based research at every stage. For teams just getting started, we offer ALOHA platform leasing through our robot leasing program — access a complete bimanual setup for a fixed monthly fee without the capital commitment of purchasing hardware. Leased systems arrive pre-calibrated and ready to collect demonstrations on day one.
For data collection, our managed service provides trained ALOHA operators who can collect demonstrations at our Palo Alto facility, with datasets delivered in RLDS/LeRobot format compatible with ACT, Diffusion Policy, and OpenVLA training pipelines. Our operators are experienced with bimanual coordination tasks and follow structured quality protocols that produce cleaner datasets than first-time researchers typically achieve. We can also visit your site for on-location data collection campaigns if your task requires it.
For policy training and evaluation, the SVRC platform provides pre-configured ACT training pipelines, experiment tracking, and evaluation tooling for ALOHA policies. Our benchmarks include ALOHA-specific task evaluations that let you compare your policy performance against reference implementations. Whether you are building a bimanual manipulation research program from scratch or trying to push the performance of an existing system, SVRC's team can help you plan the right approach.