Abstract:Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked “What vehicle is the person riding?”, computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) to answer correctly that “the person is riding a horse-drawn carriage.” In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 108K images where each image has an average of 35\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$35$$\end{document} objects, 26\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$26$$\end{document} attributes, and 21\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$21$$\end{document} pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.

Neural Naturalist: Generating Fine-Grained Image Comparisons

Bird song comparison using deep learning trained from avian perceptual judgments

From Captions to Visual Concepts and Back

NATURAL LANGUAGE DESCRIPTION OF REMOTE SENSING IMAGES BASED ON DEEP LEARNING

Part-based Fine-Grained Bird Image Retrieval Respecting Species Correlation

Learning Deep Representations of Fine-Grained Visual Descriptions

Benchmarking Representation Learning for Natural World Image Collections

Visual DNA: Representing and Comparing Images using Distributions of Neuron Activations

A high-throughput approach for the efficient prediction of perceived similarity of natural objects

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Natural Language Descriptions of Deep Visual Features

Discoverability in Satellite Imagery: A Good Sentence is Worth a Thousand Pictures

Leveraging Habitat Information for Fine-grained Bird Identification

Multilingual Image Description with Neural Sequence Models

Towards Generating and Evaluating Iconographic Image Captions of Artworks

A Fine-Grained Recognition Neural Network with High-Order Feature Maps via Graph-Based Embedding for Natural Bird Diversity Conservation

CD-GAN: Commonsense-Driven Generative Adversarial Network with Hierarchical Refinement for Text-to-Image Synthesis

Tell-the-difference: Fine-grained Visual Descriptor via a Discriminating Referee

Feathers dataset for Fine-Grained Visual Categorization

Can Giraffes Become Birds? An Evaluation of Image-to-image Translation for Data Generation

Neuraltalk+: neural image captioning with visual assistance capabilities