PRIOR: Prototype Representation Joint Learning from Medical Images and Reports

Pujin Cheng,Li Lin,Junyan Lyu,Yijin Huang,Wenhan Luo,Xiaoying Tang
2024-03-11
Abstract:Contrastive learning based vision-language joint pre-training has emerged as a successful representation learning strategy. In this paper, we present a prototype representation learning framework incorporating both global and local alignment between medical images and reports. In contrast to standard global multi-modality alignment methods, we employ a local alignment module for fine-grained representation. Furthermore, a cross-modality conditional reconstruction module is designed to interchange information across modalities in the training phase by reconstructing masked images and reports. For reconstructing long reports, a sentence-wise prototype memory bank is constructed, enabling the network to focus on low-level localized visual and high-level clinical linguistic features. Additionally, a non-auto-regressive generation paradigm is proposed for reconstructing non-sequential reports. Experimental results on five downstream tasks, including supervised classification, zero-shot classification, image-to-text retrieval, semantic segmentation, and object detection, show the proposed method outperforms other state-of-the-art methods across multiple datasets and under different dataset size settings. The code is available at <a class="link-external link-https" href="https://github.com/QtacierP/PRIOR" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the issue of insufficient fine-grained representation in joint pre-training of medical images and reports. Specifically: 1. **Challenges of Cross-Modal Alignment**: Existing methods usually focus only on global information when aligning medical images and reports, neglecting the alignment of local information. This leads to poor performance in tasks that require fine-grained information, such as semantic segmentation and object detection. 2. **Importance of Low-Level Features**: Medical image analysis tasks are often very sensitive to low-level features (such as lesion boundaries), and existing methods tend to overlook the learning of these low-level features. 3. **Handling Complex Text Structures**: The textual information in medical reports is usually very complex and often describes specific sub-regions. Existing methods struggle to effectively handle this complex text structure. To address these issues, the paper proposes a new prototype representation framework (PRIOR), which effectively captures fine-grained features by combining global and local alignment and cross-modal conditional reconstruction, achieving excellent performance on multiple downstream tasks.