CALVIN: Long-Horizon Language-Conditioned Robot Manipulation
A 24-hour, four-environment tabletop manipulation dataset with full natural language labeling — the canonical benchmark for compositional instruction following in robot policies.
TL;DR
| Metric | Value |
|---|---|
| Task count | 34 distinct subtasks, 5-instruction evaluation chains |
| Robots | Franka Emika Panda (PyBullet simulation) |
| Modalities | RGB static + gripper cameras, depth, proprioception, language |
| License | MIT |
| Size | ~24 hours of teleop play data (~166 GB uncompressed) |
| Environments | A, B, C, D (different textures, lighting, object layouts) |
What is CALVIN?
CALVIN (Composing Actions from Language and Vision) was released by the University of Freiburg to stress-test language-conditioned robot policies on long time horizons. It takes a different data-collection philosophy from clip-style benchmarks: rather than recording one demonstration per labeled task, CALVIN recorded ~24 hours of undirected teleoperated "play" on a Franka Emika Panda and then retroactively segmented and crowd-labeled those trajectories with natural language. This produces both a dense imitation learning signal and a multi-task language corpus of hundreds of distinct English phrasings per skill.
The benchmark ships four environments (A, B, C, D) that share the same tabletop geometry but differ in textures, lighting, and arrangements of articulated objects — a sliding door, a drawer, a handle, blocks in various colors, and an LED button. The four-letter training/evaluation splits (ABCD→D, ABC→D, D→D) let researchers isolate environment generalization from pure imitation performance.
CALVIN is the benchmark of choice whenever a paper claims "long-horizon language following" because the official evaluation protocol chains five consecutive instructions and only counts the rollout as successful if all five subtasks finish in order. That compounding failure mode makes CALVIN numbers much harder to saturate than single-task benchmarks, which is why HULC, RT-2, GR-1, and RoboFlamingo all report on it.
How to download & load
# Clone the benchmark
git clone --recursive https://github.com/mees/calvin.git
cd calvin && sh install.sh
# Download the full ABCD split (~166 GB)
cd dataset && sh download_data.sh ABCD
# Iterate demos in Python
from calvin_env.envs.play_table_env import get_env
from calvin_agent.datasets.npz_dataset import NpzDataset
d = NpzDataset(data_dir='dataset/task_ABCD_D/training')
print(d[0]['rgb_static'].shape, d[0]['language']['ann'])
For training at scale, most teams convert CALVIN into RLDS or LeRobot format so it can share a dataloader with Open X-Embodiment and BridgeData v2. Conversion scripts are maintained in the calvin/calvin_agent/datasets directory.
Common use cases & pairings
- Language-conditioned policies. HULC, HULC++, and GR-1 all publish CALVIN numbers; it is the shortest path to demonstrating a new language-conditioned policy works.
- VLA fine-tuning. RoboFlamingo, RT-2, and OpenVLA variants fine-tune on CALVIN to validate language grounding; sequence-chain accuracy is a good proxy for VLA instruction following.
- Latent planning / world models. The long unlabeled play data is frequently used to pretrain latent plan generators (LATMO, SkillMimic) before downstream task specialization.
- Sim-to-real validation. The four-environment split lets researchers separate texture and layout shift from actual manipulation capability.
Benchmarks & leaderboards
The reported metric is the average length of successful instruction chains (0 to 5). See the Papers with Code CALVIN leaderboard and the official project site for current state of the art.
Related datasets
- LIBERO — sibling benchmark for lifelong imitation learning
- BridgeData v2 — real-world language-conditioned manipulation
- Open X-Embodiment — cross-embodiment pretraining that pairs well with CALVIN fine-tunes
- RoboMimic — single-task imitation baseline