Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering

Haifan Gong,Guanqi Chen,Sishuo Liu,Yizhou Yu,Guanbin Li
DOI: https://doi.org/10.48550/arXiv.2105.00136
2021-05-01
Abstract:Due to the severe lack of labeled data, existing methods of medical visual question answering usually rely on transfer learning to obtain effective image feature representation and use cross-modal fusion of visual and linguistic features to achieve question-related answer prediction. These two phases are performed independently and without considering the compatibility and applicability of the pre-trained features for cross-modal fusion. Thus, we reformulate image feature pre-training as a multi-task learning paradigm and witness its extraordinary superiority, forcing it to take into account the applicability of features for the specific image comprehension task. Furthermore, we introduce a cross-modal self-attention~(CMSA) module to selectively capture the long-range contextual relevance for more effective fusion of visual and linguistic features. Experimental results demonstrate that the proposed method outperforms existing state-of-the-art methods. Our code and models are available at <a class="link-external link-https" href="https://github.com/haifangong/CMSA-MTPT-4-MedicalVQA" rel="external noopener nofollow">this https URL</a>.
Multimedia
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in Medical Visual Question Answering (MVQA), due to the serious shortage of labeled data, existing methods usually rely on transfer learning to obtain effective image feature representations and use cross - modal fusion techniques to achieve answer prediction related to questions. However, these methods are carried out independently when performing image feature pre - training and cross - modal fusion, without considering the applicability and compatibility of pre - trained features for specific image understanding tasks. This has led to limitations in the effectiveness of feature representations and model performance. To overcome these problems, the authors propose the following solutions: 1. **Multi - task pre - training**: Redefine image feature pre - training as a multi - task learning paradigm, so that the applicability of features to specific image understanding tasks is considered in the pre - training stage. 2. **Cross - Modal Self - Attention (CMSA) module**: Introduce a CMSA module to selectively capture long - distance context correlations, thereby fusing visual and linguistic features more effectively. Through these improvements, the method proposed in the paper has achieved better performance on the VQA - RAD dataset than the existing state - of - the - art methods.