Image2Text: A multimodal caption generator

Chang Liu,Fuchun Sun,Changhu Wang,Yong Rui
2016-01-01
Abstract:In this work, we showcase the Image2Text system, which is a real-time captioning system that can generate human-level natural language description for any input image. We formulate the problem of image captioning as a multimodal translation task. Analogous to machine translation, we present a sequence-to-sequence recurrent neural networks (RNN) model for image caption generation. Different from most existing work where the whole image is represented by a convolutional neural networks (CNN) feature, we propose to represent the input image as a sequence of detected objects to serve as the source sequence of the RNN model. Based on the captioning framework, we develop a user-friendly system to automatically generated human-level captions for users. The system also enables users to detect salient objects in an image, and retrieve similar images and corresponding descriptions from a database.
What problem does this paper attempt to address?