Skip to yearly menu bar Skip to main content


Poster

Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training

Jinxia Yang · Bing Su · Xin Zhao · Ji-Rong Wen


Abstract:

Medical vision language pre-training methods mainly leverage the correspondence between paired medical images and radiological reports. Although multi-view spatial images and temporal sequences of image-report pairs are available in off-the-shelf multi-modal medical datasets, most existing methods have not thoroughly tapped into such extensive supervisory signals. In this paper, we introduce the Med-ST framework for fine-grained spatial and temporal modeling to exploit information from multiple spatial views of chest radiographs and temporal historical records. For spatial modeling, Med-ST employs the Mixture of View Expert architecture to integrate distinctive visual features from both frontal and lateral views. In addition to the global alignment between whole images and texts, Med-ST establishes modality-weighted local alignment between text tokens and spatial regions of images. For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective by forward mapping classification and reverse mapping regression. By perceiving temporal information from simple to complex, Med-ST can learn temporal semantics. Experimental results across four distinct tasks demonstrates the effectiveness of Med-ST, especially in temporal classification tasks.

Live content is unavailable. Log in and register to view live content