CALVIN: Long-Horizon Language-Conditioned Robot Manipulation

Name: CALVIN
Creator: University of Freiburg
License: https://opensource.org/licenses/MIT

A 24-hour, four-environment tabletop manipulation dataset with full natural language labeling — the canonical benchmark for compositional instruction following in robot policies.

TL;DR

Metric	Value
Task count	34 distinct subtasks, 5-instruction evaluation chains
Robots	Franka Emika Panda (PyBullet simulation)
Modalities	RGB static + gripper cameras, depth, proprioception, language
License	MIT
Size	~24 hours of teleop play data (~166 GB uncompressed)
Environments	A, B, C, D (different textures, lighting, object layouts)

What is CALVIN?

CALVIN (Composing Actions from Language and Vision) was released by the University of Freiburg to stress-test language-conditioned robot policies on long time horizons. It takes a different data-collection philosophy from clip-style benchmarks: rather than recording one demonstration per labeled task, CALVIN recorded ~24 hours of undirected teleoperated "play" on a Franka Emika Panda and then retroactively segmented and crowd-labeled those trajectories with natural language. This produces both a dense imitation learning signal and a multi-task language corpus of hundreds of distinct English phrasings per skill.

The benchmark ships four environments (A, B, C, D) that share the same tabletop geometry but differ in textures, lighting, and arrangements of articulated objects — a sliding door, a drawer, a handle, blocks in various colors, and an LED button. The four-letter training/evaluation splits (ABCD→D, ABC→D, D→D) let researchers isolate environment generalization from pure imitation performance.

CALVIN is the benchmark of choice whenever a paper claims "long-horizon language following" because the official evaluation protocol chains five consecutive instructions and only counts the rollout as successful if all five subtasks finish in order. That compounding failure mode makes CALVIN numbers much harder to saturate than single-task benchmarks, which is why HULC, RT-2, GR-1, and RoboFlamingo all report on it.

How to download & load

# Clone the benchmark
git clone --recursive https://github.com/mees/calvin.git
cd calvin && sh install.sh

# Download the full ABCD split (~166 GB)
cd dataset && sh download_data.sh ABCD

# Iterate demos in Python
from calvin_env.envs.play_table_env import get_env
from calvin_agent.datasets.npz_dataset import NpzDataset
d = NpzDataset(data_dir='dataset/task_ABCD_D/training')
print(d[0]['rgb_static'].shape, d[0]['language']['ann'])

For training at scale, most teams convert CALVIN into RLDS or LeRobot format so it can share a dataloader with Open X-Embodiment and BridgeData v2. Conversion scripts are maintained in the calvin/calvin_agent/datasets directory.

Common use cases & pairings

Language-conditioned policies. HULC, HULC++, and GR-1 all publish CALVIN numbers; it is the shortest path to demonstrating a new language-conditioned policy works.
VLA fine-tuning. RoboFlamingo, RT-2, and OpenVLA variants fine-tune on CALVIN to validate language grounding; sequence-chain accuracy is a good proxy for VLA instruction following.
Latent planning / world models. The long unlabeled play data is frequently used to pretrain latent plan generators (LATMO, SkillMimic) before downstream task specialization.
Sim-to-real validation. The four-environment split lets researchers separate texture and layout shift from actual manipulation capability.

Benchmarks & leaderboards

The reported metric is the average length of successful instruction chains (0 to 5). See the Papers with Code CALVIN leaderboard and the official project site for current state of the art.

Related datasets

LIBERO — sibling benchmark for lifelong imitation learning
BridgeData v2 — real-world language-conditioned manipulation
Open X-Embodiment — cross-embodiment pretraining that pairs well with CALVIN fine-tunes
RoboMimic — single-task imitation baseline

Need language-labeled robot data?

Rent a Franka Emika Panda to reproduce CALVIN sim-to-real, or order custom language-annotated teleoperation data from our San Francisco lab.

Browse Store Rent Robot Get Custom Data