Neural Naturalist: Generating Fine-Grained Image Comparisons

Maxwell Forbes,Christine Kaeser-Chen,Piyush Sharma,Serge Belongie
DOI: https://doi.org/10.48550/arXiv.1909.04101
2019-11-14
Abstract:We introduce the new Birds-to-Words dataset of 41k sentences describing fine-grained differences between photographs of birds. The language collected is highly detailed, while remaining understandable to the everyday observer (e.g., "heart-shaped face," "squat body"). Paragraph-length descriptions naturally adapt to varying levels of taxonomic and visual distance---drawn from a novel stratified sampling approach---with the appropriate level of detail. We propose a new model called Neural Naturalist that uses a joint image encoding and comparative module to generate comparative language, and evaluate the results with humans who must use the descriptions to distinguish real images. Our results indicate promising potential for neural models to explain differences in visual embedding space using natural language, as well as a concrete path for machine learning to aid citizen scientists in their effort to preserve biodiversity.
Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use natural language to explain the differences between two images in the fine - grained vision field. Specifically, the author focuses on how to generate natural language texts that describe the subtle differences between bird photos. These texts need to be detailed and understandable to non - professionals. This helps to solve a key problem in species identification, that is, how non - experts distinguish between species with similar appearances. By introducing new datasets and model architectures, the author aims to improve the performance of machine - learning models on this task, thereby assisting citizen scientists in their work on biodiversity conservation.