Supplementary Material : Natural Language Object Retrieval
Ronghang Hu,Huazhe Xu,Marcus Rohrbach,Jiashi Feng,Kate Saenko,Trevor Darrell
2016-01-01
Abstract:In this document, we visualize some results on the ReferIt dataset [1] using our SCRC model, showing that it can correctly retrieve an object by exploiting its description in context. We also evaluate our model on the Flickr30K Entities dataset [2], and show that our model can be applied to both “object” and “stuff”, and can generate descriptions over given image regions. 1. Retrieval on object descriptions in context In reality, people usually describe an object based on both the object itself and other objects plus the whole scene as context. To distinguish a specific object from others in a scene, especially when there are multiple objects of the same category, a description needs to contain not only the category name, but also other discriminative information such as locations or attributes. Figure 1 shows an example of this, where one cannot refer to a person simply using category name “person” since there are three people in the scene, but needs to use a description based on the environment as query. Our SCRC model can handle such context-based descriptions by incorporating spatial configurations and scene-level context into the recurrent network. Figure 2 shows some retrieval examples on multiple objects within the same image on ReferIt [1] dataset, where objects are described in context. 2. Object retrieval evaluation on Flickr30K Entities dataset We also train and evaluate our method on the Flickr30K Entities dataset [2] for natural language object retrieval, which contains 31,783 images and 275,775 annotated bounding boxes. The object-level annotations in this dataset are derived from existing scene-level captions in Flickr30K [3]. We train our model on the referential expressions in the Method R@1 R@10 CCA [2] 25.3% 59.7% SCRC 27.8% 62.9% Oracle 76.9% 76.9% Table 1. Performance of our method compared with Canonical Correlation Analysis (CCA) baseline on 100 EdgeBox proposals in Flickr30K Entities dataset. Oracle corresponds to the highest possible recall on all 100 proposals for any retrieval method. Flickr30K dataset using the same top-100 EdgeBox [4] proposals same as in [2]. On this dataset, our SCRC model achieves higher recall than the Canonical Correlation Analysis (CCA) method in [2], as is shown in Table 1. 3. Object vs. stuff The ReferIt dataset contains annotations on both “object” regions and “stuff” regions. In computer vision, the term object is usually used to refer to entities with closed boundary and well-defined shape, such as “car”, “person” and “laptop”. On the other hand, stuff is used for entities without a regular shape, such as “grass”, “road” and “sky”. Given an input image and a natural language query, our SCRC model is not only capable of retrieving “object” regions, but can also be applied to “stuff” regions. Figure 3 shows some examples of stuff retrieval on ReferIt dataset. 4. Generating descriptions for objects Although our SCRC model is designed for natural language object retrieval, it can also be applied in another task to generate descriptions for the objects in an image. Given an image Iim and the bounding box of an object, a text description Sdes can be generated for the object as Sdes = argmaxS p(S|Ibox, Iim, xspatial) using beam search, where Iim is the local image region of the object and xspatial is its spatial configuration. Figure 4 shows some object descriptions generated by our SCRC model on ReferIt dataset. a scene with three people query=’man far right’ query=’left guy’ query=’cyclist’ Figure 1. An example image in ReferIt dataset where objects are described based on other objects in the scene. When referring to one of the three “people” in the image, expressions based on both the object and the context are used to make the description discriminative. Our model can handle such object descriptions in context by incorporating these information into the recurrent neural network. In the images above, yellow boxes are ground truth and green boxes are correctly retrieved results by our model using highest scoring candidate from 100 EdgeBox proposals.