Understanding CLIP for vision language models

(medium.com)

1 points | by flehn 2 hours ago ago

1 comments

  • flehn 2 hours ago ago

    The key idea is to bring text that describes an image close together with the image itself in the embedding space (often referred to as latent space). We take the text embedding and the corresponding image embedding and ensure they are very close in this space, meaning they represent a similar concept.