FashionVQA: A Domain-Specific Visual Question Answering System

Min Wang,Ata Mahjoubfar,Anupama Joshi
DOI: https://doi.org/10.48550/arXiv.2208.11253
2022-08-24
Abstract:Humans apprehend the world through various sensory modalities, yet language is their predominant communication channel. Machine learning systems need to draw on the same multimodal richness to have informed discourses with humans in natural language; this is particularly true for systems specialized in visually-dense information, such as dialogue, recommendation, and search engines for clothing. To this end, we train a visual question answering (VQA) system to answer complex natural language questions about apparel in fashion photoshoot images. The key to the successful training of our VQA model is the automatic creation of a visual question-answering dataset with 168 million samples from item attributes of 207 thousand images using diverse templates. The sample generation employs a strategy that considers the difficulty of the question-answer pairs to emphasize challenging concepts. Contrary to the recent trends in using several datasets for pretraining the visual question answering models, we focused on keeping the dataset fixed while training various models from scratch to isolate the improvements from model architecture changes. We see that using the same transformer for encoding the question and decoding the answer, as in language models, achieves maximum accuracy, showing that visual language models (VLMs) make the best visual question answering systems for our dataset. The accuracy of the best model surpasses the human expert level, even when answering human-generated questions that are not confined to the template formats. Our approach for generating a large-scale multimodal domain-specific dataset provides a path for training specialized models capable of communicating in natural language. The training of such domain-expert models, e.g., our fashion VLM model, cannot rely solely on the large-scale general-purpose datasets collected from the web.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to automatically, consistently and accurately infer the visual attributes of fashion items in the visual question - answering system in the fashion field. Specifically, the paper proposes a domain - specific visual question - answering system named FashionVQA, aiming to train the model to answer complex natural - language questions about clothing in fashion photos. This can not only reduce the time and cost of manually labeling fashion item attributes, but also improve the accuracy and consistency of labeling. The main contributions of the paper include: 1. **Constructing a large - scale domain - specific dataset**: The author constructs a fashion VQA dataset containing 207,000 images and 168 million question - answer - image triples. These datasets are generated by automatically filling in question templates, taking into account the difficulty of the questions and emphasizing challenging concepts. 2. **Multimodal fusion model**: A cross - modal fusion model is proposed, which maps visual and text representations to the same latent feature space and uses a classifier module for answer prediction. This model can understand different visual attributes in the input image and their structural relationships. 3. **Performance evaluation**: Through benchmark tests, it is verified that the performance of the model in answering human - generated questions exceeds the expert level, even if these questions are not restricted by the template format. In conclusion, this paper aims to improve the performance of the visual question - answering system in the fashion field by constructing a large - scale domain - specific dataset and a multimodal fusion model, thereby achieving automated and high - precision labeling of fashion item attributes.