HumanML3D: Text-to-Motion Generation Dataset

14,616 motions with 44,970 text descriptions. The standard benchmark for language-conditioned motion generation. MIT licensed.

MIT -- Open NPY + Text Motion-Language

Key Stats

MetricValue
Motions14,616 unique motion sequences
Text descriptions44,970 (avg. ~3 per motion)
SourceAMASS motion capture + HumanAct12
Representation263-dim feature vector per frame (joints, velocities, foot contact)
LicenseMIT
Benchmark modelsMDM, MotionDiffuse, MoMask, T2M-GPT, ReMoDiffuse

What is HumanML3D?

HumanML3D is the standard benchmark dataset for text-to-motion generation. It pairs 14,616 human motion sequences from the AMASS motion capture archive with 44,970 crowd-sourced natural language descriptions. Each motion has approximately 3 text descriptions written in natural English, such as "a person walks forward slowly and then turns to the left" or "someone jumps in place three times."

The motion representation uses a 263-dimensional feature vector per frame that encodes root velocities, joint positions, joint velocities, joint rotations, and foot contact labels. This compact representation has become the de facto standard for the text-to-motion research community.

Relevance to robotics

HumanML3D-trained models enable language-conditioned motion generation for humanoid robots. Rather than hand-crafting motion primitives, engineers can describe desired behaviors in natural language and generate candidate trajectories. These trajectories can then be retargeted to specific robot embodiments (Unitree G1, NVIDIA GR1, etc.) using standard retargeting pipelines.

Related datasets

Language-conditioned robot control

We develop motion generation pipelines for humanoid robots using HumanML3D-trained models.