Abstract:Visual Question Answering (VQA) has emerged as a highly engaging field in recent years, with increasing research focused on enhancing VQA accuracy through advanced models such as Transformers. Despite this growing interest, limited work has examined the comparative effectiveness of textual encoders in VQA, particularly considering model complexity and computational efficiency. In this work, we conduct a comprehensive comparison between complex textual models that leverage long-range dependencies and simpler models focusing on local textual features within a well-established VQA framework. Our findings reveal that employing complex textual encoders is not invariably the optimal approach for the VQA-v2 dataset. Motivated by this insight, we propose ConvGRU, a model that incorporates convolutional layers to improve text feature representation without substantially increasing model complexity. Tested on the VQA-v2 dataset, ConvGRU demonstrates a modest yet consistent improvement over baselines for question types such as Number and Count, which highlights the potential of lightweight architectures for VQA tasks, especially when computational resources are limited.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: in the Visual Question Answering (VQA) task, whether complex text encoders are the optimal choice, especially when dealing with the original VQA - v2 dataset. The author explores this issue by comparing the performance of complex models (such as Transformer encoders and attention - based models) with simple models (such as RNN and CNN) on the VQA - v2 dataset. The study finds that complex text encoders are not always superior to simple models, especially in resource - limited situations. For this reason, the author proposes the ConvGRU model, which improves text feature representation by introducing convolutional layers without significantly increasing model complexity. ### Main Contributions 1. **Proposing the ConvGRU Model**: By adding convolutional layers to the Gated Recurrent Unit (GRU), local text features are utilized to improve the accuracy of VQA without increasing computational cost. 2. **Extensive Comparison of Text Encoders**: A comprehensive comparison of multiple VQA text encoders was carried out, revealing that complex models (such as Transformer encoders) may not be as effective as simple models (such as ConvGRU) in some cases. 3. **Experimental Verification**: A large number of experiments were conducted on the VQA - v2 dataset, demonstrating the consistent improvement of ConvGRU when dealing with specific types of questions (such as number and counting questions). ### Research Background Visual Question Answering (VQA) is a field that combines computer vision and natural language processing, with the goal of answering relevant questions based on a given image. Although advanced models such as Transformers have achieved remarkable success in many tasks, in the VQA task, especially in resource - limited situations, the performance of these complex models is not always optimal. Therefore, exploring simple and effective text encoding methods is crucial for optimizing the performance of the VQA task. ### Method Overview The author proposes a new text feature extraction framework - ConvGRU. This framework captures local text features through convolutional layers and extracts sequential semantics through GRU units. The specific steps are as follows: 1. **Word Embedding**: Embed each word in the question into a high - dimensional space to generate a matrix \( E\in\mathbb{R}^{B\times L\times d} \), where \( B \) is the batch size, \( L \) is the sequence length, and \( d \) is the embedding dimension. 2. **Convolution Operation**: Use convolution kernels of different sizes (such as 2 and 3) to extract bi - gram and tri - gram features, and adopt different padding strategies (asymmetric head padding and symmetric padding). 3. **Feature Fusion**: Concatenate the output of the convolutional layer with the original feature matrix \( M \) to form an enhanced feature set \( X \). 4. **GRU Processing**: Process the enhanced feature set \( X \) through GRU units, calculate the update gate \( z_i \), the reset gate \( r_i \), and the new memory unit \( \tilde{h}_i \), and finally obtain the updated hidden state \( h_i \). ### Experimental Results The experimental results show that complex text encoders (such as Transformer encoders and multi - head self - attention mechanisms) are not always superior to simple models in the VQA task. Instead, local text features play a crucial role in determining model accuracy. The ConvGRU model shows consistent improvement when dealing with specific types of questions (such as number and counting questions), especially in resource - limited situations. ### Conclusion This study experimentally verifies that in the VQA task, complex text encoders are not always the optimal choice. The ConvGRU model provides a lightweight and effective solution by introducing convolutional layers to capture local text features, especially suitable for resource - constrained scenarios. This provides new directions and ideas for future VQA research.

Enhanced Textual Feature Extraction for Visual Question Answering: A Simple Convolutional Approach

Simple and Effective Visual Question Answering in a Single Modality

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Overcoming Language Priors In Vqa Via Decomposed Linguistic Representations

Advancing Vietnamese Visual Question Answering with Transformer and Convolutional Integration

Context-aware Multi-level Question Embedding Fusion for visual question answering

CLVIN: Complete language-vision interaction network for visual question answering

Graph-enhanced visual representations and question-guided dual attention for visual question answering

Multiscale Feature Extraction and Fusion of Image and Text in VQA

Improving Visual Question Answering with Pre-Trained Language Modeling

Multi-modal Contextual Graph Neural Network for Text Visual Question Answering.

Exploring Diverse Methods in Visual Question Answering

Enhancing Visual Question Answering through Ranking-Based Hybrid Training and Multimodal Fusion

Visual Question Answering As Reading Comprehension

Co-attention graph convolutional network for visual question answering

Learning Rich Image Region Representation for Visual Question Answering

Learning Sparse Mixture of Experts for Visual Question Answering

Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR

The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions