Camera Types Comparison

Three camera technologies are commonly used in robot data collection setups. Your choice affects cost, latency determinism, and integration complexity.

TypeExample ModelPriceLatencyBest For
USB (UVC)Logitech BRIO 4K~$20050–200ms variableBudget setups, low-frequency tasks
GigE VisionBasler ace2 a2A1920$400–$1,500<1ms deterministicHigh-quality datasets, policy training
Depth (RGB-D)Intel RealSense D435~$20030–60msSupplemental depth — not recommended as primary

USB cameras (Logitech BRIO, ELP, Arducam) are the most accessible but suffer from variable latency caused by USB host controller scheduling. Under load, frame delivery can jitter by 20–50 ms, which desynchronizes multi-camera recordings and degrades policy training. For single-camera or low-frame-rate setups (<15 fps), USB is acceptable.

GigE Vision cameras (Basler ace2, FLIR Blackfly S, Allied Vision Alvium) deliver frames over Ethernet with deterministic <1 ms latency when hardware-triggered. The Basler ace2 a2A1920-160ucBAS at $650 offers 1920×1200 at 160 fps, more than sufficient for 30 fps robot recording. GigE cameras require a dedicated NIC with jumbo frames enabled (ip link set eth0 mtu 9000) and a PoE switch or injector.

Depth cameras (RealSense D435, Azure Kinect) are useful for supplemental 3D scene understanding but are not recommended as primary recording cameras. Their rolling shutter, depth noise at object boundaries, and difficulty with shiny/dark surfaces make them unsuitable as the sole visual observation. Use them in addition to RGB cameras if your policy requires depth input.

Recommended 3-Camera Setup

The following configuration is used at SVRC for standard manipulation data collection and balances coverage, resolution, and storage cost:

  • Camera 1 — Fixed overhead (top-down): Mounted 80–100 cm above the workspace, pointing straight down. Resolution 1280×960 at 30 fps. Captures the full workspace, object placement, and gripper approach from above. This is the most informative view for most pick-and-place policies.
  • Camera 2 — Fixed side (lateral): Mounted at workspace height, 60–80 cm to the side. Resolution 1280×960 at 30 fps. Provides height information that the overhead view cannot. Critical for stacking, pouring, and insertion tasks.
  • Camera 3 — Wrist (ego-centric): Mounted on the robot's end-effector or tool flange, facing forward. Resolution 640×480 at 60 fps. The higher frame rate captures fast wrist motions without blur. Ego-centric views significantly improve grasping and fine manipulation policy performance in imitation learning research.

For bimanual setups, add a fourth fixed camera on the opposite side from Camera 2 to cover inter-arm occlusions.

Synchronization Methods

Multi-camera synchronization is critical. A 33 ms desynchronization between cameras at 30 fps means one camera is one full frame behind — policies trained on desynchronized data learn incorrect temporal correlations.

Hardware GPIO trigger (recommended for GigE cameras): A single trigger pulse is generated by a microcontroller (Arduino Uno at $25 or Raspberry Pi GPIO) and wired to the trigger input of all cameras simultaneously. Achieves <1 ms synchronization. Configure cameras in external trigger mode via Pylon (Basler) or SpinView (FLIR). The trigger pulse is also logged to your data file, giving you a precise common timestamp.

Software synchronization via ntpd: Synchronize all recording machines to a common NTP server (sudo apt install ntp, use pool.ntp.org or a local Chrony server for ±2 ms accuracy on LAN). ROS2 timestamps using rclpy.clock.Clock(clock_type=ClockType.SYSTEM_TIME) will then be consistent across machines to ±10 ms. Adequate for 15 fps recording but not for 60 fps wrist cameras.

Camera Calibration

Calibration has two components: intrinsics per camera, and extrinsics (relative poses between cameras and to the robot base).

Intrinsic calibration characterizes each camera's focal length, principal point, and distortion coefficients. Use OpenCV's calibration module with a 9×7 checkerboard (25 mm squares). Collect 20–40 images at varied angles and distances. Target a reprojection error <0.5 px (acceptable up to 1.0 px). Run calibration with: python3 -m cv2.calibrate --size 9x7 --square 25 images/*.png.

Extrinsic calibration determines the 6-DOF transform from each camera frame to the robot base frame. Use a ChArUco board (better corner detection than plain checkerboard) mounted in several known poses. For each camera, collect 15–20 board observations across the workspace volume. The resulting transforms are stored as static TF frames in your ROS2 parameter file and used to project observations into a common robot-relative coordinate frame for policy training.

Verify calibration by projecting the robot's TCP position (from forward kinematics) into each camera image. The projected point should align with the visible TCP to within 5 pixels at all workspace positions. Errors >10 px typically indicate an incorrect camera mount pose — recheck your mount rigidity and recollect extrinsic data.

Recording Pipeline

The recording pipeline must capture synchronized frames, robot joint states, and action labels into a single file per demonstration.

  • ROS2 image_transport: Subscribe to camera topics using image_transport::Subscriber with the compressed transport. This offloads JPEG/H.264 encoding to the camera driver and reduces CPU load on the recording machine.
  • H.264 compression: Configure camera drivers to output H.264 at 10–30 Mbps. At 10 Mbps, a 1280×960@30fps stream uses roughly 75 MB/minute per camera. At 30 Mbps (high quality for fine-grained policies), 225 MB/minute per camera.
  • HDF5 logging: Simultaneously write synchronized frames, joint positions, joint velocities, end-effector pose, and gripper state to HDF5 using h5py. Structure your HDF5 with one group per demonstration episode: /episode_0042/observations/images/cam_overhead, /episode_0042/actions/joint_positions.
  • Frame alignment: At write time, align all data to the trigger timestamp. Drop frames that arrive more than 5 ms late rather than using them with incorrect timestamps.

Storage Planning

Before starting a data collection campaign, calculate your storage requirements:

Example: 3-camera setup at 15 Mbps per camera, 8-hour collection day: 3 cameras × 15 Mbps × 3600s/hr × 8hr ÷ 8 bits/byte ÷ 1e9 GB = 162 GB/day. At 10 Mbps average: 108 GB/day. Plan for a 4 TB NVMe SSD per recording station, with nightly rsync to a NAS or cloud bucket.

For cloud storage, use SVRC's Data Services which provides automated upload, deduplication, and dataset versioning. Raw recordings are compressed further (lossless for joint data, lossy H.264 for video) reducing long-term storage to approximately 40 GB per collection day.

Lighting Setup

Consistent lighting dramatically improves policy generalization. Policies trained under inconsistent lighting often fail in deployment when lighting conditions differ.

  • Color temperature: Use 5500K daylight-balanced LED ring lights (e.g., Neewer 18" ring light at $60/unit). Consistent color temperature across all lights prevents white balance variation between cameras.
  • Placement: Position lights at 45° angles from the overhead camera axis to minimize specular reflections off shiny objects and robot surfaces. Avoid lights directly behind any camera.
  • Diffusion: Add diffusion panels (frosted acrylic sheets) in front of lights to eliminate hard shadows. Hard shadows create visual features that don't generalize to different times of day.
  • Blackout curtains: For lab setups near windows, install blackout curtains to eliminate ambient light variation from clouds and sun angle. This is one of the highest-ROI investments in data collection quality.