Skip to yearly menu bar Skip to main content


Poster

Molecule-Space: Free Lunch in Unified Multimodal Space via Knowledge Fusion

Zehan Wang · Ziang Zhang · xize cheng · Rongjie Huang · Luping Liu · Zhenhui Ye · Haifeng Huang · Yang Zhao · Tao Jin · Peng Gao · Zhou Zhao


Abstract:

Unified multi-model representation space pre-trained on massive data is the foundation of multimodal understanding and generation. However, the billions of model parameters and catastrophic forgetting problems make it challenging and costly to further enhance pre-trained unified spaces. In this work, we propose Molecule-Space, an idea that treats multimodal representation spaces as "molecules", and augments pre-trained unified space by integrating knowledge from extra expert spaces via "molecules space reactions". Specifically, we introduce two kinds of basic space reactions: 1) Space Displacement Reaction and 2) Space Combination Reaction. Based on these defined basic reactions, we design Complex Sequential & Parallel Reactions to effectively integrate multiple spaces simultaneously. Benefiting from the modularization concept, we further propose a coarse-to-fine customized inference strategy to flexibly adjust the enhanced unified space for different purposes. Experimentally, we fuse the audio-image-text space of ImageBind with the image-text and audio-text expert spaces. The resulting space significantly outperforms ImageBind on five downstream tasks across nine datasets. Moreover, via customized inference, it even demonstrates superior performance in image-text and audio-text compared to the source expertise spaces.

Live content is unavailable. Log in and register to view live content