Abstract:Visual Question Answering (VQA) is the most significant area that adopts both computer vision techniques and natural language processing techniques. Among all the question types, the most challenging question type is said to be counting, such as "How many?" Still, VQA models consist of certain difficulties in counting the objects that are present in the natural images. The basic technique in the VQA involved either classifying answers according to a definite-length description of both the question and image or estimating summing fractional counts from every image segment. Soft attention in these methods is utilized to find these primary issues. To circumvent this problem, the main intention of this paper is to implement the latest visual question-answering system based on a counting scenario. At first, the standard benchmark datasets related to the visual question-answering system are gathered. This question-answering system dataset is usually incorporated with both images and questions. Hence, feature extraction is adopted for both questions and images. For the questions, the text pre-processing is initially employed by punctuation removal, stemming, and stop word removal and the word2vec features are extracted. Similarly, the deep features of the given images are extracted from the pooling layer of the Deep Convolutional Neural Network (DCNN). These two sets of features are integrated and are fed to the selection of optimal feature procedures for acquiring the most significant features that are giving unique information. The selection of optimal features is handled by the Optimized Deep Neural-Long Short-Term Memory (DN-LSTM). It needs less time and computational complexity and also can be applied to solving all engineering optimization problems. It also can tackle multilevel thresholding problems. These advantages in the Parameter Improved-Elephant Herding Optimization (PI-EHO) over the conventional optimization algorithms seek more attention for choosing the EHO in the designed method. Finally, the answer generation is done by hybrid deep learning with Long Short Term Memory (LSTM) and Deep Neural Network (DNN), for which the architecture is improvised by the proposed EHO. The given designed method is experimented on the different data sets, yielding promising results when compared to existing methods.

Design and development of counting-based visual question answering model using heuristic-based feature selection with deep learning

Simple and Effective Visual Question Answering in a Single Modality

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Question-Driven Multiple Attention(DQMA) Model for Visual Question Answer

Question-guided Feature Pyramid Network for Medical Visual Question Answering

An Improved Attention and Hybrid Optimization Technique for Visual Question Answering

The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions

Visual Question Answering Model Based on Visual Relationship Detection

Deep Attention Neural Tensor Network For Visual Question Answering

Multi-source Multi-level Attention Networks for Visual Question Answering

Research and implementation of visual question and answer system based on deep learning

Learning neighbor-enhanced region representations and question-guided visual representations for visual question answering

Visual Question Answering using Deep Learning: A Survey and Performance Analysis

A Comprehensive Survey on Visual Question Answering Datasets and Algorithms

Dual self-attention with co-attention networks for visual question answering

A lightweight Transformer-based visual question answering network with Weight-Sharing Hybrid Attention

Task-driven Visual Saliency and Attention-based Visual Question Answering

Answer Again: Improving VQA With Cascaded-Answering Model

Overcoming the Limitations of Learning-Based VQA for Counting Questions with Zero-Shot Learning

Knowledge-aware image understanding with multi-level visual representation enhancement for visual question answering

AI-VQA