Understanding CLIP for vision language models

(medium.com)

1 points | by flehn a year ago ago

1 comments

flehn a year ago ago
The key idea is to bring text that describes an image close together with the image itself in the embedding space (often referred to as latent space). We take the text embedding and the corresponding image embedding and ensure they are very close in this space, meaning they represent a similar concept.