Skip to yearly menu bar Skip to main content


Poster

Modeling Caption Diversity in Contrastive Visual Language Pretraining

Samuel Lavoie · Polina Kirichenko · Mark Ibrahim · Mahmoud Assran · Andrew Wilson · Aaron Courville · Nicolas Ballas


Abstract:

There are a thousand ways to caption an image.Contrastive Language Pretraining (CLIP) on the other hand, works by mapping an image and its caption to a single vector---limiting how well CLIP-like models can represent the diverse ways to describe an image. In this work, we introduce Llip, Latent Language Image Pretraining, which models the diversity of captions that could match an image.Llip's vision encoder outputs a set of feature proposals that are mixed into a final visual prediction by conditioning on the context derived from the text.We show Llip outperforms non-contextualized baselines like CLIP and SigLIP on a variety of tasks. Llip improves zero-shot classification by an average of 2\% zero-shot classification benchmarks. Specifically, Llip attains a zero-shot top-1 accuracy of 80.9\% on ImageNet with a ViT-L/14, outperforming a similarly sized CLIP by 1.4\% and a larger CLIP pre-trained on a ViT-H by 0.4\%. We also demonstrate improvement on zero-shot retrieval on MS-COCO by 3.6\% .We provide a comprehensive analysis of the components introduced by the method and demonstrate that Llip leads to more robust, richer visual representations.

Live content is unavailable. Log in and register to view live content