Multimodal Local Perception Bilinear Pooling for Visual Question Answering

Mingrui Lao,Yanming Guo,Hui Wang,Xin Zhang
DOI: https://doi.org/10.1109/access.2018.2873570
IF: 3.9
2018-01-01
IEEE Access
Abstract:Visual question answering is a challenging multimodal task, which has received increasing attention in recent years. One key solution to visual question answering is how to fuse the visual and textual features extracted from the image and questions, and thus, we can comprehensively employ the information from both modals and deliver correct answers. Bilinear pooling has been a powerful fusion approach owing to its exhausting interaction of each element of two modals, but its overuse of parameters limits its practical application. In this paper, we aim to retain the advantages of bilinear pooling for feature interaction and propose a novel multimodal feature fusion approach named multimodal local perception bilinear (MLPB) pooling, which can retain the second-order interactions between visual and textual features with limited learning parameters. To be specific, the MLPB utilizes local perception mechanism, which transforms the bilinear pooling between two high-dimensional raw features into multiple low-dimensional part features. To further reduce the computational cost, we propose to share the learning parameters of each local bilinear pooling. In this way, MLPB can achieve the complex interactions of the bilinear pooling without taking up too much computational resource. Extensive experiments show that the proposed method can achieve competitive or better performance than the state of the art.
What problem does this paper attempt to address?