Skip to yearly menu bar Skip to main content


Poster

Language-Driven Cross-Modal Classifier for Zero-Shot Multi-Label Image Recognition

Yicheng Liu · Jie Wen · Chengliang Liu · xiaozhao fang · Zheng Zhang · Zuoyong Li · Yong Xu


Abstract:

Large-scale pre-trained vision-language models (e.g., CLIP) demonstrate powerful zero-shot transfer capability in image recognition tasks. Recent works generally use supervised fine-tuning to adapt CLIP to zero-shot multi-label image recognition tasks. However, obtaining sufficient multi-label annotated image data for training is exceptionally challenging and not scalable. In this paper, to reduce the reliance on annotated images, we propose a new language-driven framework for zero-shot multi-label recognition without any images for training. Based on the aligned CLIP embedding space, our method leverages language data to train a cross-modal classifier and transfer it to the visual modality. However, directly applying the classifier to the visual inputs may limit the performance due to the modality gap phenomenon. To mitigate the impact of the modality gap, we propose a cross-modal mapping method to map the image embedding to the language modality while retaining crucial visual information. Experiments on MS-COCO, VOC2007, and NUS-WIDE datasets show that our method outperforms other zero-shot multi-label recognition methods and even achieves competitive results compared with few-shot methods.

Live content is unavailable. Log in and register to view live content