Learning Semantic Feature Map for Visual Content Recognition

Rui-Wei Zhao,Zuxuan Wu,Jianguo Li,Yu-Gang Jiang
DOI: https://doi.org/10.1145/3123266.3123379
2017-01-01
Abstract:The spatial relationship among objects provide rich clues to object contexts for visual recognition. In this paper, we propose to learn Semantic Feature Map (SFM) by deep neural networks to model the spatial object contexts for better understanding of image and video contents. Specifically, we first extract high-level semantic object features on input image with convolutional neural networks for every object proposals, and organize them to the designed SFM so that spatial information among objects are preserved. To fully exploit the spatial relationship among objects, we employ either Fully Convolutional Networks (FCN) or Long-Short Term Memory (LSTM) on top of SFM for final recognition. For better training, we also introduce a multi-task learning framework to train the model in an end-to-end manner. It is composed of an overall image classification loss as well as a grid labeling loss, which predicts the objects label at each SFM grid. Extensive experiments are conducted to verify the effectiveness of the proposed approach. For image classification, very promising results are obtained on Pascal VOC 2007/2012 and MS-COCO benchmarks. We also directly transfer the SFM learned on image domain to the video classification task. The results on CCV benchmark demonstrate the robustness and generalization capability of the proposed approach.
What problem does this paper attempt to address?