DINOv2

A self-supervised vision transformer model trained by Meta on a curated dataset of 142M images using a self-distillation objective. DINOv2 learns powerful visual representations without any labels. Its features transfer well to robotics tasks — manipulation policies using frozen DINOv2 encoders achieve strong performance with minimal fine-tuning on robot data.

MLVisionRepresentation Learning

Explore More Terms

Browse the full robotics glossary.

Back to Glossary