Why Single-Modality Sensing Falls Short
Vision-only manipulation fails at contact. Once the robot's fingers are on an object, they often occlude the contact zone from the camera. Visual policies that work well during approach break down during the precision contact phase of tasks — insertion, stable grasping, surface following — because the information they need is now behind the robot's own hand.
Force-only control (impedance control, force servo) lacks spatial information. It can detect when contact is made and regulate contact force, but cannot tell you whether you are touching the right surface, whether the grasp pose is stable, or whether the object has moved.
Tactile sensing alone provides rich contact geometry information but has limited spatial range — you have to already be at the contact point. It is the highest-bandwidth, shortest-range modality.
The combination of all three covers each modality's failure modes and produces policies that are substantially more robust than any single modality alone.
Sensor Modalities in Detail
| Modality | Sensor Example | Information | Update Rate | Cost |
|---|---|---|---|---|
| RGB camera (fixed) | Intel RealSense D435 | Scene context, object pose | 30–90 Hz | $150–300 |
| Depth camera | Intel RealSense D435i | 3D geometry, point cloud | 30–90 Hz | $200–400 |
| Wrist camera (RGB) | FLIR Blackfly S | Close-up contact zone | 60–200 Hz | $400–1,200 |
| Wrist F/T sensor | ATI Mini45 | 6-axis contact wrench | 1,000–7,000 Hz | $3,000–8,000 |
| Tactile (GelSight) | GelSight Mini | High-res contact image | 30–60 Hz | $300–600 |
| Tactile (capacitive) | XELA uSkin | 3-axis force per taxel | 100–500 Hz | $1,500–5,000 |
Fusion Architectures
Early fusion: Concatenate all sensor feature vectors into a single input to the policy network. Simple to implement. Requires all sensors to be present at both training and deployment time — if any sensor fails or is not available, the policy cannot operate. Appropriate when sensor availability is guaranteed and you want the simplest possible implementation.
Late fusion: Train independent encoders for each modality, then combine at a bottleneck layer with attention or concatenation before the action head. More robust to missing sensors — you can mask out a modality's contribution during deployment if needed. Recommended for production systems where sensor failure is a real possibility.
Cross-attention transformer fusion: Treat each sensor's output as a sequence of tokens and use transformer cross-attention to let modalities attend to each other. This is the architecture used in recent foundation models (OpenVLA uses visual tokens + language tokens with cross-attention). More expressive than late fusion but requires more data to train effectively.
Practical Implementation Recommendation
- Starting point — vision only: Overhead camera + wrist camera. This configuration is sufficient for 70-80% of manipulation tasks and requires no force or tactile hardware.
- Contact-rich tasks: Add wrist F/T sensor. The 6-axis wrench provides contact detection and force regulation that dramatically improves insertion and assembly task performance.
- Dexterous manipulation: Add tactile sensing (GelSight or capacitive array). The high-resolution contact geometry enables in-hand re-grasping and slip detection.
- Do not add modalities preemptively: Each additional modality adds data collection complexity (you need demonstrations with all sensors recording), annotation overhead, and training complexity. Add sensors when you hit a specific failure mode that they address.
Training Data Implications
Multimodal policies require demonstrations with all sensor modalities recorded simultaneously. This means your data collection setup must have all sensors calibrated and logging synchronously at the time of collection. Retrofitting force or tactile data to existing vision-only demonstrations is not possible — you need to re-collect.
One practical strategy: collect all demonstrations with the full sensor suite, then train ablated models (vision-only, vision+force, full multimodal) and evaluate each. This tells you the marginal value of each sensing modality for your specific task, which informs future data collection investments.
SVRC's data collection setup supports multi-sensor recording with synchronized timestamping across RGB, depth, wrist camera, and wrist F/T. Custom sensor integrations available on request.