Wrist Camera vs. Fixed Camera for Robot Manipulation: Tradeoffs and Best Practices

Why Camera Placement Matters

Post-mortems on failed robot learning projects consistently identify perception, not control, as the root cause of policy failures. And within perception, camera placement — not camera quality — is the most impactful variable. A well-placed $200 RealSense D435i outperforms a poorly placed $1,500 industrial camera for manipulation tasks.

Camera placement determines: what visual information the policy receives, whether arm occlusion blocks critical moments, how much the calibration drifts over time, and whether the policy generalizes across workspace positions. Getting it wrong at the start of a data collection program means either re-collecting all data or accepting a systematically underperforming policy.

Fixed Camera: Advantages

Stable calibration: A rigidly mounted camera maintains its extrinsic calibration (position and orientation relative to the robot base) indefinitely, modulo thermal expansion. No dynamic calibration required. Recalibrate at setup and after any physical disturbance.
No cable management: Fixed cameras connect via standard USB 3 or GigE cables to a nearby PC. No cable routing through the robot arm, no risk of cable fatigue from repeated arm motions, no weight added to the end-effector.
Wide field of view: A camera mounted 60–80 cm above and slightly behind the workspace captures the entire relevant scene — object, gripper, arm, and environment — providing global context for task planning and state estimation.
Easy multi-view setup: Multiple fixed cameras at different angles (overhead, side, wrist-height front) are straightforward to add. Each camera has independent rigid mounting and independent calibration.
Better for global task state estimation: Tasks requiring understanding of the full workspace (e.g., "where should I place this object given the current arrangement?") benefit from the global view that only a fixed camera provides.

Fixed Camera: Disadvantages

Arm occlusion during approach: As the arm moves toward the object, the arm structure and gripper occlude the camera's view of the gripper-object interface — precisely the moment where accurate visual feedback is most critical. This is the primary failure mode for fixed-camera-only manipulation systems on precision tasks.
Limited close-up resolution: At a fixed-camera distance of 60–80 cm, 1 camera pixel corresponds to approximately 0.5–1 mm of workspace. For tasks requiring <3 mm precision, this resolution is insufficient for the final contact phase.
Position-dependent performance: Object appearance (size, lighting, perspective) changes significantly as the object moves across the workspace. The policy must learn to generalize over these view-dependent changes — increasing effective data requirements.

Wrist Camera: Advantages

Follows the end-effector: The wrist camera maintains a consistent, close-up view of the gripper and nearby object regardless of arm configuration. This is the natural human analog — we look at what our hands are doing during precision manipulation.
No gripper-object occlusion: From the wrist camera perspective, the gripper is in the scene but the object-gripper interface is always visible (with a forward-facing or slightly downward-angled mount). The major source of fixed-camera failure is eliminated.
Consistent appearance regardless of workspace position: Because the camera moves with the arm, object appearance in the wrist camera is nearly identical at all workspace positions (same distance, similar angle). This dramatically reduces the amount of data needed to generalize across workspace positions.
High effective resolution at contact: At a wrist-camera distance of 10–20 cm from the object, 1 pixel corresponds to <0.1 mm. Even a 640×480 camera at this distance provides precision-grade visual information for contact-phase sensing.

Wrist Camera: Disadvantages

Dynamic calibration required: The wrist camera's extrinsic calibration (position relative to the gripper) must be known precisely. Hand-eye calibration (using a calibration target in a known position) is required at setup and after any physical disturbance to the wrist mount. This adds 30–60 minutes of setup time per recalibration.
Cable routing constraints: The camera cable must be routed from the wrist, along the arm, to the robot base and then to the PC. Cable routing that does not bind or chafe during full arm range of motion is a real engineering challenge. Spring-retractable cable management or wireless camera modules mitigate this.
Vibration and shock: Wrist-mounted cameras experience mechanical vibration from the robot motors, particularly during high-speed moves or contact events. This can blur images at critical moments. A short exposure time (≤4 ms) and mechanical damping mount reduce this.
Limited global context: The wrist camera shows only a small region around the gripper. For tasks requiring global workspace awareness (multi-object placement, sequential manipulation with widely separated objects), wrist-only vision is insufficient.

The SVRC Standard: 3-Camera System

Based on policy performance data across 40+ manipulation tasks, SVRC's standard camera configuration is a 3-camera system: two fixed cameras at wrist height from different angles, plus one wrist-mounted camera.

Camera 1 (fixed, front-left): Intel RealSense D435i, mounted at wrist height (60–70 cm above table) at 30–45° angle from frontal. Captures the full workspace with depth.
Camera 2 (fixed, front-right): Same model, symmetric placement. Together with Camera 1, provides stereo depth reconstruction and dual coverage that eliminates most arm occlusion from the fixed perspective.
Camera 3 (wrist-mounted): OAK-D (OpenCV AI Kit with Depth) or RealSense D405 (designed for wrist mounting, 5–50 cm range optimized), mounted on the gripper facing forward-down at 30° below horizontal.

Resolution Recommendations

Camera Position	Minimum Resolution	Recommended	Notes
Fixed (60–80 cm distance)	1280×720 @ 30fps	1920×1080 @ 30fps	Higher res important for far-field precision
Wrist (10–20 cm distance)	640×480 @ 30fps	848×480 @ 60fps	Higher fps helps with fast gripper motion
Overhead (90–120 cm)	1280×720 @ 30fps	1920×1080 @ 15fps	Global context; lower fps acceptable

Hardware Trigger Synchronization

For multi-camera systems, frame synchronization is critical. Without hardware sync, cameras drift relative to each other (due to USB frame timing variance), causing training data with mismatched multi-view images. This is a documented source of policy degradation.

Hardware trigger: A GPIO trigger signal (from the robot controller or a dedicated sync box) fires all cameras simultaneously. RealSense D435i supports hardware trigger sync via GPIO. Target <1 ms inter-camera jitter.
Software sync (fallback): If hardware trigger is not available, use NTP synchronization and select matched frames by timestamp. Achieves ~10–30 ms sync accuracy — acceptable for 30 fps cameras (33 ms per frame) but marginal.

Policy Performance: Camera Configuration Comparison

Camera Setup	Success Rate (Precision Assembly)	Success Rate (Pick-and-Place)	Notes
Single fixed (overhead)	42%	74%	Severe occlusion during approach
Single wrist only	65%	68%	Good precision, poor global context
Two fixed (stereo)	58%	81%	Better occlusion coverage, still limited close-up
Fixed + wrist (2-camera)	78%	86%	Good balance, standard for research
Two fixed + wrist (3-camera)	84%	89%	SVRC standard; best overall performance

The 3-camera configuration delivers a 19-percentage-point improvement over single fixed camera on precision assembly tasks — a difference that is unlikely to be recovered by algorithm changes or more data from an inferior camera setup.

All demonstrations in SVRC's data collection program are collected using the 3-camera standard configuration, with hardware-synchronized frames and calibrated extrinsics included in the delivered dataset.