Caption-Aware Medical VQA Via Semantic Focusing and Progressive Cross-Modality Comprehension

Fuze Cong,Shibiao Xu,Li Guo,Yinbing Tian
DOI: https://doi.org/10.1145/3503161.3548122
2022-01-01
Abstract:Medical Visual Question Answering as a specific-domain task requires substantive prior knowledge of medicine. However, deep learning techniques encounter severe problems of limited supervision due to the scarcity of well-annotated large-scale medical VQA datasets. As an alternative to facing the data limitation problem, image captioning can be introduced to learn summary information about the picture, which is beneficial to question answering. To this end, we propose a caption-aware VQA method that can read the summary information of image content and clinic diagnoses from plenty of medical images and answer the medical question with richer multimodality features. The proposed method consists of two novel components emphasizing semantic locations and semantic content respectively. Firstly, to extract and leverage the semantic locations implied in image captioning, similarity analysis is designed to summarize the attention maps generated from image captioning by their relevance and guide the visual model to focus on the semantic-rich regions. Besides, to combine the semantic content in the generated captions, we propose a Progressive Compact Bilinear Interactions structure to achieve cross-modality comprehension over the image, question and caption features by performing bilinear attention in a gradual manner. Qualitative and quantitative experiments on various medical datasets exhibit the superiority of the proposed approach compared to the state-of-the-art methods.
What problem does this paper attempt to address?