Dual-decoder transformer network for answer grounding in visual question answering

Liangjun Zhu,Li Peng,Weinan Zhou,Jielong Yang
DOI: https://doi.org/10.1016/j.patrec.2023.04.003
IF: 4.757
2023-04-20
Pattern Recognition Letters
Abstract:Visual Question Answering (VQA) have made stunning advances by exploiting Transformer architecture and large-scale visual-linguistic pretraining. State-of-the-art methods generally require large amounts of data and devices to predict textualized answers and fail to provide visualized evidence of the answers. To mitigate these limitations, we propose a novel dual-decoder Transformer network (DDTN) for predicting the language answer and corresponding vision instance. Specifically, the linguistic features are first embedded by Long Short-Term Memory (LSTM) block and Transformer encoder, which are shared between the Transformer dual-decoder. Then, we introduce object detector to obtain vision region features and grid features for reducing the size and cost of DDTN. These visual features are combined with the linguistic features and are respectively fed into two decoders. Moreover, we design an instance query to guide the fused visual-linguistic features for outputting the instance mask or bounding box. The classification layers aggregate results from decoders and predict answer as well as corresponding instance coordinates at last. Without bells and whistles, DDTN achieves state-of-the-art performance and even competitive to pretraining models on VizWizGround and GQA dataset. The code will release in https://github.com/zlj63501/DDTN .
computer science, artificial intelligence
What problem does this paper attempt to address?