Abstract:With the application of deep learning method in the field of image processing, the image-related intelligent interaction technology has also been rapidly developed. Visual question answering (VQA) collects the image information by asking questions related to the image and ultimately achieves the purpose for enriching the image understanding. Vision and language are the two core parts of human intelligence to understand the real world, and also the basic components to realize artificial intelligence, and a lot of research has been carried out in their respective fields. With the continuous promotion and application of deep learning in the fields of computer vision and natural language processing, visual question answering technology across the visual field and natural language disciplines has become a research hotspot in recent years. Visual question answering (VQA) for intelligent interaction collects image information by asking relevant questions to the content of the image and finally achieves the purpose of enriching image understanding. At the same time, as an emerging research direction, the challenges faced by the visual question answering system are huge, and we need to learn and excavate. Through the comprehensive comparison and analysis of the existing models and methods of visual question answering, this paper summarizes the shortcomings and development directions of the current research work and analyzes several models of visual question answering technology for the processing of image input and question input of the visual question answering model. The working principle of the model and the common public data set of the model: it is concluded that extending the structured knowledge base and applying mature technologies such as text question answering and natural language processing to deal with VQA problems are the future development directions of the VQA model.

On the Cognition of Visual Question Answering Models and Human Intelligence: A Comparative Study

Overcoming Language Priors In Vqa Via Decomposed Linguistic Representations

Visual Question Answering As Reading Comprehension

Visual Question Answering Method Based on Counterfactual Thinking

Visual Question Answering Via Combining Inferential Attention and Semantic Space Mapping

Achieving Human Parity on Visual Question Answering

Visual Question Answering for Intelligent Interaction

Perceptual Visual Reasoning with Knowledge Propagation

Visual Question Answering by Pattern Matching and Reasoning

AI-VQA: Visual Question Answering based on Agent Interaction with Interpretability

Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering

Vqa: Visual question answering

AI-VQA

An effective spatial relational reasoning networks for visual question answering

Knowing Where to Look? Analysis on Attention of Visual Question Answering System

Inverse Visual Question Answering: A New Benchmark and VQA Diagnosis Tool

Multimodal Cross-guided Attention Networks for Visual Question Answering

Knowledge-aware image understanding with multi-level visual representation enhancement for visual question answering