CLIP
Contrastive Language-Image Pre-training — a model trained by OpenAI on 400M image-text pairs to learn aligned visual and linguistic representations. CLIP embeddings are used in robotics for open-vocabulary object detection, language-conditioned manipulation, and reward specification. VLA models like RT-2 and SayCan leverage CLIP-style vision-language alignment to ground language commands in robotic actions.