Beyond Words: ESC‐Net Revolutionizes VQA by Elevating Visual Features and Defying Language Priors

Souvik Chowdhury,Badal Soni
DOI: https://doi.org/10.1111/coin.70010
2024-12-05
Computational Intelligence
Abstract:Language prior is a pressing problem in the VQA domain where a model provides an answer favoring the most frequent related answer. There are some methods that are adopted to mitigate language prior issue, for example, ensemble approach, the balanced data approach, the modified evaluation strategy, and the modified training framework. In this article, we propose a VQA model, "Ensemble of Spatial and Channel Attention Network (ESC‐Net)," to overcome the language bias problem by improving the visual features. In this work, we have used regional and global image features along with an ensemble of combined channel and spatial attention mechanisms to improve visual features. The model is a simpler and effective solution than existing methods to solve language bias. Extensive experiment show a remarkable performance improvement of 18% on the VQACP v2 dataset with a comparison to current state‐of‐the‐art (SOTA) models.
computer science, artificial intelligence
What problem does this paper attempt to address?