Semantic R-CNN for Natural Language Object Detection.

Shuxiong Ye,Zheng Qin,Kaiping Xu,Kai Huang,Guolong Wang
DOI: https://doi.org/10.1007/978-3-319-77383-4_10
2017-01-01
Abstract:In this paper, we present a simple and effective framework for natural language object detection, to localize a target within an image based on description of the target. The method, called semantic R-CNN, extends RPN (Region Proposal Network) [1] by adding LSTM [20] module for processing natural language query text. LSTM [20] module take encoded query text and image descriptors as input and output the probability of the query text conditioned on visual features of candidate box and whole image. Those candidate boxes are generated by RPN and their local features are extracted by ROI pooling. RPN can be initialized from pre-trained Faster R-CNN model [1], transfers object visual knowledge from traditional object detection domain to our task. Experimental results demonstrate that our method significantly outperform previous baseline SCRC (Spatial Context Recurrent ConvNet) [7] model on Referit dataset [8], moreover, our model is simple to train similar to Faster R-CNN.
What problem does this paper attempt to address?