Google Robot Benchmark
Real-world manipulation evaluation. 700+ tasks, multiple robot embodiments.
Overview
The Google Robot Benchmark evaluates policies on real physical robots across 700+ tasks. Supports WidowX and other embodiments. Metrics include success rate, multi-task performance, and language grounding. Used to evaluate OpenVLA, RT-X, and related models.
Key Results
- InternVLA-M1: 71.7% (WidowX), 76–81% (other embodiments)
- OpenVLA: Outperforms RT-2-X by 16.5% on 29 tasks
Related
- BridgeData — WidowX dataset
- OpenVLA — Model evaluation