TAM at VQA-Med 2021 - A Hybrid Model with Feature Extraction and Fusion for Medical Visual Question Answering.

Yong Li,Zhenguo Yang,Tianyong Hao
2021-01-01
Abstract:This paper briefly describes our model for the ImageCLEF Medical Visual Question Answering Task 2021 (ImageCLEF VQA-Med task 2021). Our method is based on a universal VQA framework and consists of image feature extraction module, question feature extraction module and feature fusion module. We employ the modified ResNet-34 as the backbone to construct an image feature extractor, which effectively extracts pixel-level features and enhances the model performance in a deep network. For question feature extraction, we firstly use word embedding to map question tokens to high dimension vectors, and then input them to a long-short-term memory (LSTM) to extract high-level question features. In addition, we leverage Multi-modal Factorized Bilinear Pooling (MFB) with a co-Attention mechanism to fuse these features to predict final answers. Our model achieves the accuracy score of 0.222 and bleu score of 0.255, ranking at the eighth among all participating teams in the VQA-Med task.
What problem does this paper attempt to address?