Overcoming language priors with self-contrastive learning for visual question answering
Hong Yan,Lijun Liu,Xupeng Feng,Qingsong Huang
DOI: https://doi.org/10.1007/s11042-022-14167-2
IF: 2.577
2022-11-12
Multimedia Tools and Applications
Abstract:Although remarkable success has been achieved in the last few years on the Visual Question Answer(VQA) task, most existing models are heavily driven by the surface linguistic correlation in the training set and ignore the image contents. Several recent methods introduce auxiliary tasks (visual annotation, counterfactual samples, etc.) to overcome language priors and enhance image dependence. However, the inherent priors, which evaluate whether the original models are driven by memorizing priors in training data, still have not been resolved. Therefore, we proposed a novel self-contrastive learning method contrasting the answers to the question predicted by question-relevant regions and question-irrelevant regions to solve this problem without introducing auxiliary tasks. Concretely, when the question pays attention to the question-relevant regions and the question-irrelevant regions, different answer spaces are generated to form a contrast to prevent the model from being driven by surface language priors. Therefore, the question is forced to rely on relevant image regions to predict the correct answer. Extensive experiments on the benchmark dataset demonstrate the effectiveness of our method. Particularly, by building on top of the model LMH, our method achieves the state-of-the-art performance of 59.00% on the most commonly used benchmark VQA-CP v2 without auxiliary tasks, with an improvement of 6.51%.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering