LIBERO vs CALVIN: Robot Learning Benchmark Comparison 2026

LIBERO and CALVIN are two of the most widely cited simulated benchmarks in modern robot learning. Both target language-conditioned manipulation, but they were designed with very different research goals: LIBERO for lifelong learning and VLA evaluation on discrete task suites, CALVIN for long-horizon chains of skills in a shared tabletop environment. If you are choosing which one to report numbers on in 2026, this guide gives you a head-to-head comparison across scale, structure, language, embodiment, compute, and licensing.

TL;DR

  • Lifelong / continual learning: LIBERO is the default — it was built explicitly for knowledge-transfer evaluation across four task suites.
  • Long-horizon language chains: CALVIN wins — its unique selling point is evaluating sequences of up to five language instructions in one rollout.
  • VLA model benchmarking: LIBERO is more common in recent papers (OpenVLA, pi0, Octo) because its discrete suites are easy to report.
  • Environment split generalization: CALVIN wins — four visually distinct environments (A, B, C, D) give a clean train/test protocol out of the box.
  • Ease of running locally: LIBERO is lighter — Robosuite + MuJoCo, pip install libero. CALVIN ships a dedicated PyBullet-based simulator.

At a glance

Attribute LIBERO CALVIN
First release2023 (NeurIPS D&B track)2022 (RA-L)
Primary goalLifelong learning + imitation / VLA evalLong-horizon language-conditioned manipulation
Task suites4 suites: Spatial, Object, Goal, LIBERO-100 (Long)34 skill classes, chained into sequences of 5
Total tasks~130 tasksOne shared environment with compositional task chains
Demonstration scaleAround 6500 human teleoperation demos (50 per task)~24 hours of teleoperation across 4 environments
SimulatorRobosuite / MuJoCoPyBullet (custom tabletop)
EmbodimentFranka Emika Panda (7-DoF)Franka Emika Panda (7-DoF) with parallel gripper
Environment splitsPer-suite train/test with held-out configurationsEnvironments A, B, C, D for zero-shot transfer
Language labelsYes, per taskYes, crowdsourced natural language instructions
Evaluation metricSuccess rate per suite + forward / backward transferAvg. sequence length successfully completed (out of 5)
LicenseMITMIT
PaperarXiv:2306.03310arXiv:2112.03227

What LIBERO actually tests

LIBERO ("Lifelong learning benchmark for robot manipulation") was built at UT Austin and collaborators as a response to a gap in the imitation-learning literature: most prior benchmarks measured policy quality on a fixed task distribution, but very few measured how a robot accumulates and transfers knowledge as new tasks arrive. The benchmark is structured into four task suites, each with around 10 tasks and 50 teleoperation demonstrations per task, for a total of roughly 130 tasks and 6,500 trajectories.

The four suites are LIBERO-Spatial (same objects, different spatial relationships), LIBERO-Object (different object instances in the same layout), LIBERO-Goal (different goal instructions with shared objects), and LIBERO-100 (Long), a much harder set of 100 long-horizon tasks drawn from realistic kitchen and living-room scenes. The first three are designed to isolate a single axis of distribution shift; LIBERO-100 mixes everything together.

What makes LIBERO appealing for modern VLA evaluation is that the discrete task structure maps cleanly onto leaderboard-style reporting. When OpenVLA, Octo, pi0, and similar foundation policies are evaluated, papers typically report four numbers — one per suite — plus an average. That is much easier to consume than a single long-horizon score and it has effectively made LIBERO the default VLA eval suite since mid-2024.

What CALVIN actually tests

CALVIN ("Composing Actions from Language and Vision") was released by the University of Freiburg in 2022 and answers a different question: can a policy chain skills it has seen individually into long sequences specified by natural language? Instead of hundreds of separate task files, CALVIN ships a single shared tabletop environment with drawers, sliding doors, coloured blocks, an LED, and a button. Around 34 elementary skill types can be composed into chains of five instructions that the policy has to execute back-to-back.

The dataset is roughly 24 hours of human teleoperation distributed across four visually distinct environments (A, B, C, D). The standard evaluation protocol trains on one (or a subset) of the environments and evaluates on another, producing a natural zero-shot or few-shot generalization metric. The headline number CALVIN reports is the average number of instructions successfully completed out of five, which creates a steep difficulty gradient that is still not saturated in 2026.

CALVIN is also one of the few benchmarks where you can see how much long-horizon performance collapses relative to single-skill accuracy. Policies that get 95 percent on individual skills often drop to under two chained instructions, which is exactly the kind of signal you want when testing memory, planning, or goal-conditioned objectives.

Tradeoffs: when to pick which

Pick LIBERO if you want a fast, low-friction benchmark that plugs into Robosuite, you care about lifelong or continual learning, you want your numbers to be directly comparable to the OpenVLA / Octo / pi0 leaderboard, or you need a clean per-suite breakdown for an ablation. LIBERO is also the easier choice if you run your RL environments on a laptop — MuJoCo is light, and a single suite finishes in an hour on a decent GPU.

Pick CALVIN if you need long-horizon compositional reasoning, you care about the gap between skill-level and chain-level performance, or you want to test goal-image or language-only conditioning in a shared environment. CALVIN is also the more interesting benchmark for hierarchical or planning-based policies because its chained sequences surface failures that a per-task metric hides.

Pick both if you are publishing a VLA paper in 2026. It has become a near-requirement to report LIBERO numbers, and CALVIN sequences give reviewers the long-horizon signal that LIBERO does not.

Compute and hardware

Both benchmarks run on a single consumer GPU. LIBERO, because of MuJoCo, is CPU-bound for the physics step and benefits from an 8+ core CPU when rolling out many parallel environments. CALVIN runs on PyBullet, which makes it less parallel-friendly but easier to debug with image-space probes. Neither benchmark needs a data-center GPU — an RTX 4090 or even a 3090 is enough to evaluate full suites.

For training, the picture is different. Full-scale VLA training on top of LIBERO or CALVIN demos still wants 4-8 GPUs, not because the sim is heavy but because the vision-language backbones are. If you are bringing your own trained policy and only evaluating, a single workstation works. If you want to iterate on policy architectures, plan for multi-GPU hosts. Our data platform can also offload trajectory storage and evaluation orchestration.

Data collection workflow

Both datasets were collected through keyboard / gamepad teleoperation, not VR. This is historically significant because it keeps the demos low-noise and highly repeatable, but it also makes the motion distribution narrower than a VR-collected dataset like BridgeData or RoboMimic. Teams that train on LIBERO or CALVIN and deploy on real hardware frequently need to augment with more diverse teleop data. That is where our custom data services fit in.

Licensing reality check

Both projects are MIT-licensed, which is about as permissive as academic data gets. You can train commercial policies on either dataset, redistribute derivative checkpoints, and embed them in products. That said, the assets (mesh files, texture maps) bundled with Robosuite and CALVIN have their own licenses; if you extract and re-host assets, re-check the upstream attribution. In practice, teams treat the trajectory data as MIT and the simulators as the upstream license of Robosuite or PyBullet.

Ecosystem and tooling in 2026

LIBERO has the larger ecosystem of wrappers: LeRobot supports it, HuggingFace has several preprocessed releases, and most VLA repos ship a LIBERO eval script. CALVIN sits a layer deeper — to use it you typically clone the official CALVIN repo and run their eval harness. Both are well-maintained; LIBERO is the safer pick if you want zero integration work.

Reported results you will see in 2026 papers

On LIBERO, strong 2025-era VLA policies (OpenVLA, pi0, Octo variants) typically land in the 70-90 percent success range on Spatial, Object, and Goal, and 30-60 percent on LIBERO-100 depending on training recipe. The spread is interesting: policies that do well on the first three suites sometimes fall off a cliff on LIBERO-100, which reveals a generalization gap that a single averaged number would hide. When you read a new paper, always check the per-suite breakdown rather than the mean. Papers that report only the average are, frankly, hiding something.

On CALVIN, the metric to watch is Avg. Seq. Len. — the mean number of instructions completed in a rollout of five. Strong 2025 models reach roughly 3.5-4.0 on the easier A-to-A evaluation and drop to 2.0-3.0 on the zero-shot D evaluation. Anything above 4.0 on zero-shot D is considered state-of-the-art as of early 2026. Because CALVIN uses a product-of-successes metric, even small per-step regressions compound dramatically, which is precisely the property that makes CALVIN a useful benchmark.

Integration with the LeRobot stack

Both datasets are available through LeRobot in mirrored form. LeRobot normalizes the trajectory schema, exposes a PyTorch Dataset interface, and handles chunked video decoding. If you are training a diffusion policy or transformer policy from scratch, the LeRobot path is the fastest way to consume both LIBERO and CALVIN without writing custom loaders. For evaluation, however, you still want to run the original LIBERO or CALVIN eval harness — they carry the official task configurations and success checkers that the leaderboards depend on.

Verdict

There is no single winner. If you can only run one, run LIBERO — it is cheaper, faster, more standard in 2026 papers, and plugs into the widest range of pretrained policies. If your research question is specifically about long-horizon language chains, CALVIN is still unique and still not saturated. Most serious VLA papers today report both, and you probably should too. Teams with tight compute budgets should prioritize LIBERO; teams studying planning, memory, or hierarchical policies should prioritize CALVIN. And teams that are serious about shipping should, after benchmarks, move to real-world data collection — our data services can bridge the gap between LIBERO-style sim eval and a deployable real-world policy.

Related resources on our site

Need demos LIBERO and CALVIN do not cover?

We collect high-quality teleoperation data on your hardware, in your environments, with your tasks.