Abstract:Scene classification has established itself as a challenging research problem. Compared to images of individual objects, scene images could be much more semantically complex and abstract. Their difference mainly lies in the level of granularity of recognition. Yet, image recognition serves as a key pillar for the good performance of scene recognition as the knowledge attained from object images can be used for accurate recognition of scenes. The existing scene recognition methods only take the category label of the scene into consideration. However, we find that the contextual information that contains detailed local descriptions are also beneficial in allowing the scene recognition model to be more discriminative. In this paper, we aim to improve scene recognition using attribute and category label information encoded in objects. Based on the complementarity of attribute and category labels, we propose a Multi-task Attribute-Scene Recognition (MASR) network which learns a category embedding and at the same time predicts scene attributes. Attribute acquisition and object annotation are tedious and time consuming tasks. We tackle the problem by proposing a partially supervised annotation strategy in which human intervention is significantly reduced. The strategy provides a much more cost-effective solution to real world scenarios, and requires considerably less annotation efforts. Moreover, we re-weight the attribute predictions considering the level of importance indicated by the object detected scores. Using the proposed method, we efficiently annotate attribute labels for four large-scale datasets, and systematically investigate how scene and attribute recognition benefit from each other. The experimental results demonstrate that MASR learns a more discriminative representation and achieves competitive recognition performance compared to the state-of-the-art methods

Learning Object-Centric Representations of Multi-Object Scenes from Multiple Views

Learning Object Spatial Relationship from Demonstration

Unsupervised Object-Centric Learning from Multiple Unspecified Viewpoints

A Novel Multi-View Object Recognition in Complex Background

Unsupervised Learning of Compositional Scene Representations from Multiple Unspecified Viewpoints

3M3D: Multi-view, Multi-path, Multi-representation for 3D Object Detection

UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation

Learning 3D object-centric representation through prediction

MM-Point: Multi-View Information-Enhanced Multi-Modal Self-Supervised 3D Point Cloud Understanding

Scene Recognition with Objectness, Attribute and Category Learning

Simultaneous Recognition and Modeling for Learning 3-D Object Models From Everyday Scenes.

Unsupervised Multi-View CNN for Salient View Selection and 3D Interest Point Detection

Unsupervised Multi-view CNN for Salient View Selection of 3D Objects and Scenes

Object-Centric Multiple Object Tracking

Improving Viewpoint-Independent Object-Centric Representations through Active Viewpoint Selection

Unsupervised Discovery of Object-Centric Neural Fields

Compositional scene modeling with global object-centric representations

Multiview Scene Graph

Learning Disentangled Representation for Multi-View 3D Object Recognition.

LoCUS: Learning Multiscale 3D-consistent Features from Posed Images

UniScene: Multi-Camera Unified Pre-training via 3D Scene Reconstruction for Autonomous Driving