Abstract:Generating natural, diverse, and meaningful questions from images is an essential task for multimodal assistants as it confirms whether they have understood the object and scene in the images properly. The research in visual question answering (VQA) and visual question generation (VQG) is a great step. However, this research does not capture questions that a visually-abled person would ask multimodal assistants. Recently published datasets such as KB-VQA, FVQA, and OK-VQA try to collect questions that look for external knowledge which makes them appropriate for multimodal assistants. However, they still contain many obvious and common-sense questions that humans would not usually ask a digital assistant. In this paper, we provide a new benchmark dataset that contains questions generated by human annotators keeping in mind what they would ask multimodal digital assistants. Large scale annotations for several hundred thousand images are expensive and time-consuming, so we also present an effective way of automatically generating questions from unseen images. In this paper, we present an approach for generating diverse and meaningful questions that consider image content and metadata of image (e.g., location, associated keyword). We evaluate our approach using standard evaluation metrics such as BLEU, METEOR, ROUGE, and CIDEr to show the relevance of generated questions with human-provided questions. We also measure the diversity of generated questions using generative strength and inventiveness metrics. We report new state-of-the-art results on the public and our datasets.

A Question Type Driven Framework to Diversify Visual Question Generation

Diverse Visual Question Generation based on Multiple Objects Selection

A Reinforcement Learning Framework for Natural Question Generation Using Bi-discriminators

Question-type Driven Question Generation

Exploring Diverse Methods in Visual Question Answering

Visual Question Generation Under Multi-granularity Cross-Modal Interaction.

Learning to Generate Visual Questions with Noisy Supervision

Information Maximizing Visual Question Generation

Diversifying Question Generation over Knowledge Base via External Natural Questions

Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference

Generating Natural Questions from Images for Multimodal Assistants

A one-to-many conditional generative adversarial network framework for multiple image-to-image translations

Learning to Generate Question by Asking Question: A Primal-Dual Approach with Uncommon Word Generation

Debiased Visual Question Answering via the perspective of question types

Deep Bayesian Network for Visual Question Generation

QTG-VQA: Question-Type-Guided Architectural for VideoQA Systems

Goal-Oriented Visual Question Generation Via Intermediate Rewards

Image Content Generation with Causal Reasoning

Diversify Question Generation with Retrieval-Augmented Style Transfer

VQA-GEN: A Visual Question Answering Benchmark for Domain Generalization

MultiQG-TI: Towards Question Generation from Multi-modal Sources