Robot Camera Setup for Data Collection: Wrist, Overhead, and Stereo

Camera placement is one of the most important and most frequently underspecified decisions in robot data collection. The observations your policy sees during training must match what it will see during deployment — and getting the camera setup wrong means collecting data that cannot train a reliable policy.

Camera Placement Strategy

The first principle of robot camera placement is: cameras used for data collection must be identical in mounting position to cameras used for policy deployment. There is no recovery from this mismatch — a policy trained on wrist camera views cannot generalize to an overhead camera view, and vice versa. Define your deployment camera configuration before you collect a single episode of training data.

The most common configurations in manipulation research are: wrist-only (one camera mounted on the robot's wrist, looking forward at the manipulation workspace); overhead-only (one or two cameras mounted on a fixed overhead rig); and multi-view (wrist camera plus one or two external cameras providing global workspace context). Multi-view configurations consistently outperform single-view in policy performance, at the cost of more complex recording infrastructure.

Wrist Cameras: Pros, Cons, and Best Practices

Wrist cameras provide a first-person view of the manipulation action — the robot sees approximately what it is doing at its end-effector. This viewpoint is highly informative for fine grasping and insertion tasks where the relationship between gripper and object must be perceived precisely. Wrist cameras also automatically follow the gripper through the workspace, ensuring the target object is always in frame during manipulation.

The main limitation of wrist cameras is that they do not see the global workspace — the robot cannot perceive objects far from its current gripper position without moving the arm. This limits their effectiveness for tasks requiring scene-level understanding or bi-manual coordination. For bimanual systems, each arm should carry its own wrist camera. Recommended specs: 1080p or higher resolution, 60+ fps, global shutter (not rolling shutter) to avoid motion blur during fast movements, and a wide-angle lens (90–110 degree FOV) to maintain view of the grasp contact point at close range.

Overhead Cameras: Configuration and Tradeoffs

Fixed overhead cameras provide stable, consistent workspace views that capture the full manipulation scene. They are less sensitive to arm motion and provide better context for tasks requiring multiple sequential steps across different workspace regions. Overhead cameras are simpler to mount consistently across multiple robot stations, which matters for large-scale data collection campaigns.

The limitation is reduced detail at the manipulation contact point. An overhead camera at 80 cm height looking down at a tabletop workspace cannot reliably observe gripper-object contact geometry on small objects. This is why overhead cameras are typically paired with wrist cameras in high-performance data collection setups — the overhead view provides task context and coarse positioning, while the wrist view provides fine manipulation detail.

Resolution, Frame Rate, and Synchronization

For manipulation data collection, 480p–720p per camera at 30 fps is sufficient for most imitation learning policies in 2026. Higher resolution (1080p) improves performance on tasks requiring fine spatial discrimination. Frame rates below 30 fps introduce temporal aliasing that degrades policy learning on fast tasks. Frame rates above 60 fps provide diminishing returns for most manipulation tasks and significantly increase storage requirements.

Multi-camera synchronization is critical and frequently neglected. If cameras are not hardware-synchronized, time-stamp alignment must be implemented carefully during data loading. Even 33 ms of inter-camera offset (one frame at 30 fps) can introduce training instability for tasks where the wrist and overhead views must be temporally consistent. The Intel RealSense D435 and D455 series support hardware synchronization via a sync cable and are SVRC's preferred choice for synchronized multi-camera setups.

Depth Cameras

Depth cameras provide per-pixel distance measurements in addition to RGB imagery, enabling 3D scene understanding without explicit stereo reconstruction. Intel RealSense, Microsoft Azure Kinect, and ZED cameras are the most commonly used depth sensors in robot data collection. Depth information is valuable for tasks where object height, shape, or 3D position is important for grasp planning, and for policies that use point cloud inputs rather than pure image inputs.

The tradeoff: depth cameras add weight, cost, and processing load. Many state-of-the-art imitation learning results are achieved with pure RGB cameras, suggesting depth is not always necessary. Use depth when your policy architecture explicitly benefits from 3D input, when tasks involve significant depth variation (stacking objects of different heights), or when you need robust performance across variable lighting conditions (depth is more lighting-invariant than RGB).

Calibration and SVRC's Multi-Camera Standard

Every camera must be calibrated — intrinsic calibration (focal length, distortion coefficients) and extrinsic calibration (position and orientation relative to the robot base) before data collection begins. Use a physical checkerboard target for calibration and re-calibrate after any camera movement or adjustment. Store calibration parameters as metadata with each dataset.

SVRC's data collection standard uses a fixed three-camera configuration: one wrist camera per arm plus one calibrated overhead camera per station. Physical camera mounts are part of our standardized workstation design, ensuring consistent placement across our facility. All calibration parameters are logged automatically and included in dataset exports. For teams setting up their own data collection infrastructure, SVRC offers camera setup consultation and can supply pre-calibrated camera assemblies — contact us or see our data services page for details.

Related: Mobile ALOHA Setup · Robot Data Annotation · Force Torque Sensing · Data Services