HumanML3D: Text-to-Motion Generation Dataset
14,616 motions with 44,970 text descriptions. The standard benchmark for language-conditioned motion generation. MIT licensed.
Key Stats
| Metric | Value |
|---|---|
| Motions | 14,616 unique motion sequences |
| Text descriptions | 44,970 (avg. ~3 per motion) |
| Source | AMASS motion capture + HumanAct12 |
| Representation | 263-dim feature vector per frame (joints, velocities, foot contact) |
| License | MIT |
| Benchmark models | MDM, MotionDiffuse, MoMask, T2M-GPT, ReMoDiffuse |
What is HumanML3D?
HumanML3D is the standard benchmark dataset for text-to-motion generation. It pairs 14,616 human motion sequences from the AMASS motion capture archive with 44,970 crowd-sourced natural language descriptions. Each motion has approximately 3 text descriptions written in natural English, such as "a person walks forward slowly and then turns to the left" or "someone jumps in place three times."
The motion representation uses a 263-dimensional feature vector per frame that encodes root velocities, joint positions, joint velocities, joint rotations, and foot contact labels. This compact representation has become the de facto standard for the text-to-motion research community.
Relevance to robotics
HumanML3D-trained models enable language-conditioned motion generation for humanoid robots. Rather than hand-crafting motion primitives, engineers can describe desired behaviors in natural language and generate candidate trajectories. These trajectories can then be retargeted to specific robot embodiments (Unitree G1, NVIDIA GR1, etc.) using standard retargeting pipelines.
Related datasets
- AMASS -- the motion capture foundation for HumanML3D
- CMU MoCap -- original motion capture data
- Unitree G1 Datasets -- real humanoid robot data