FashionVQA: A Domain-Specific Visual Question Answering System

Min Wang,Ata Mahjoubfar,Anupama Joshi

DOI: https://doi.org/10.48550/arXiv.2208.11253

2022-08-24

Abstract:Humans apprehend the world through various sensory modalities, yet language is their predominant communication channel. Machine learning systems need to draw on the same multimodal richness to have informed discourses with humans in natural language; this is particularly true for systems specialized in visually-dense information, such as dialogue, recommendation, and search engines for clothing. To this end, we train a visual question answering (VQA) system to answer complex natural language questions about apparel in fashion photoshoot images. The key to the successful training of our VQA model is the automatic creation of a visual question-answering dataset with 168 million samples from item attributes of 207 thousand images using diverse templates. The sample generation employs a strategy that considers the difficulty of the question-answer pairs to emphasize challenging concepts. Contrary to the recent trends in using several datasets for pretraining the visual question answering models, we focused on keeping the dataset fixed while training various models from scratch to isolate the improvements from model architecture changes. We see that using the same transformer for encoding the question and decoding the answer, as in language models, achieves maximum accuracy, showing that visual language models (VLMs) make the best visual question answering systems for our dataset. The accuracy of the best model surpasses the human expert level, even when answering human-generated questions that are not confined to the template formats. Our approach for generating a large-scale multimodal domain-specific dataset provides a path for training specialized models capable of communicating in natural language. The training of such domain-expert models, e.g., our fashion VLM model, cannot rely solely on the large-scale general-purpose datasets collected from the web.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to automatically, consistently and accurately infer the visual attributes of fashion items in the visual question - answering system in the fashion field. Specifically, the paper proposes a domain - specific visual question - answering system named FashionVQA, aiming to train the model to answer complex natural - language questions about clothing in fashion photos. This can not only reduce the time and cost of manually labeling fashion item attributes, but also improve the accuracy and consistency of labeling. The main contributions of the paper include: 1. **Constructing a large - scale domain - specific dataset**: The author constructs a fashion VQA dataset containing 207,000 images and 168 million question - answer - image triples. These datasets are generated by automatically filling in question templates, taking into account the difficulty of the questions and emphasizing challenging concepts. 2. **Multimodal fusion model**: A cross - modal fusion model is proposed, which maps visual and text representations to the same latent feature space and uses a classifier module for answer prediction. This model can understand different visual attributes in the input image and their structural relationships. 3. **Performance evaluation**: Through benchmark tests, it is verified that the performance of the model in answering human - generated questions exceeds the expert level, even if these questions are not restricted by the template format. In conclusion, this paper aims to improve the performance of the visual question - answering system in the fashion field by constructing a large - scale domain - specific dataset and a multimodal fusion model, thereby achieving automated and high - precision labeling of fashion item attributes.

FashionVQA: A Domain-Specific Visual Question Answering System

Simple and Effective Visual Question Answering in a Single Modality

Vqa: Visual question answering

Visual Question Answering by Pattern Matching and Reasoning

A Comprehensive Survey on Visual Question Answering Datasets and Algorithms

Visual Question Answering using Deep Learning: A Survey and Performance Analysis

Can Visual Language Models Replace OCR-Based Visual Question Answering Pipelines in Production? A Case Study in Retail

Right this way: Can VLMs Guide Us to See More to Answer Questions?

FashionViL: Fashion-Focused Vision-and-Language Representation Learning

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Visual question answering: A survey of methods and datasets

Visual Question Answering As Reading Comprehension

Vision–Language Model for Visual Question Answering in Medical Imagery

The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions

Visual Question Answering for Intelligent Interaction

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

Visual Question Answering Model Based on Visual Relationship Detection