Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering

Haifan Gong,Guanqi Chen,Sishuo Liu,Yizhou Yu,Guanbin Li

DOI: https://doi.org/10.48550/arXiv.2105.00136

2021-05-01

Abstract:Due to the severe lack of labeled data, existing methods of medical visual question answering usually rely on transfer learning to obtain effective image feature representation and use cross-modal fusion of visual and linguistic features to achieve question-related answer prediction. These two phases are performed independently and without considering the compatibility and applicability of the pre-trained features for cross-modal fusion. Thus, we reformulate image feature pre-training as a multi-task learning paradigm and witness its extraordinary superiority, forcing it to take into account the applicability of features for the specific image comprehension task. Furthermore, we introduce a cross-modal self-attention~(CMSA) module to selectively capture the long-range contextual relevance for more effective fusion of visual and linguistic features. Experimental results demonstrate that the proposed method outperforms existing state-of-the-art methods. Our code and models are available at <a class="link-external link-https" href="https://github.com/haifangong/CMSA-MTPT-4-MedicalVQA" rel="external noopener nofollow">this https URL</a>.

Multimedia

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in Medical Visual Question Answering (MVQA), due to the serious shortage of labeled data, existing methods usually rely on transfer learning to obtain effective image feature representations and use cross - modal fusion techniques to achieve answer prediction related to questions. However, these methods are carried out independently when performing image feature pre - training and cross - modal fusion, without considering the applicability and compatibility of pre - trained features for specific image understanding tasks. This has led to limitations in the effectiveness of feature representations and model performance. To overcome these problems, the authors propose the following solutions: 1. **Multi - task pre - training**: Redefine image feature pre - training as a multi - task learning paradigm, so that the applicability of features to specific image understanding tasks is considered in the pre - training stage. 2. **Cross - Modal Self - Attention (CMSA) module**: Introduce a CMSA module to selectively capture long - distance context correlations, thereby fusing visual and linguistic features more effectively. Through these improvements, the method proposed in the paper has achieved better performance on the VQA - RAD dataset than the existing state - of - the - art methods.

Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering

Asymmetric cross-modal attention network with multimodal augmented mixup for medical visual question answering

Medical visual question answering with symmetric interaction attention and cross-modal gating

Self-supervised vision-language pretraining for Medical visual question answering

Medical visual question answering via corresponding feature fusion combined with semantic attention

Design as Desired: Utilizing Visual Question Answering for Multimodal Pre-training

MISS: A Generative Pretraining and Finetuning Approach for Med-VQA

Question-guided Feature Pyramid Network for Medical Visual Question Answering

Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering

[A medical visual question answering approach based on co-attention networks]

Parallel multi-head attention and term-weighted question embedding for medical visual question answering

Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training

Cross-Modal Multistep Fusion Network with Co-Attention for Visual Question Answering

Medical visual question answering using joint self-supervised learning

AMAM: An Attention-based Multimodal Alignment Model for Medical Visual Question Answering

MAPM: multiscale attention pre-training model for TextVQA

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

MF2-MVQA: A Multi-stage Feature Fusion method for Medical Visual Question Answering

Examine Before You Answer: Multi-task Learning with Adaptive-attentions for Multiple-choice VQA.