Abstract:The goal of few-shot image recognition is to classify different categories with only one or a few training samples. Previous works of few-shot learning mainly focus on simple images, such as object or character images. Those works usually use a convolutional neural network (CNN) to learn the global image representations from training tasks, which are then adapted to novel tasks. However, there are many more abstract and complex images in real world, such as scene images, consisting of many object entities with flexible spatial relations among them. In such cases, global features can hardly obtain satisfactory generalization ability due to the large diversity of object relations in the scenes, which may hinder the adaptability to novel scenes. This paper proposes a composite object relation modeling method for few-shot scene recognition, capturing the spatial structural characteristic of scene images to enhance adaptability on novel scenes, considering that objects commonly co- occurred in different scenes. In different few-shot scene recognition tasks, the objects in the same images usually play different roles. Thus we propose a task-aware region selection module (TRSM) to further select the detected regions in different few-shot tasks. In addition to detecting object regions, we mainly focus on exploiting the relations between objects, which are more consistent to the scenes and can be used to cleave apart different scenes. Objects and relations are used to construct a graph in each image, which is then modeled with graph convolutional neural network. The graph modeling is jointly optimized with few-shot recognition, where the loss of few-shot learning is also capable of adjusting graph based representations. Typically, the proposed graph based representations can be plugged in different types of few-shot architectures, such as metric-based and meta-learning methods. Experimental results of few-shot scene recognition show the effectiveness of the proposed method.

Compositional scene modeling with global object-centric representations

Learning Global Object-Centric Representations via Disentangled Slot Attention

Semantic-guided modeling of spatial relation and object co-occurrence for indoor scene recognition

Generative Modeling of Infinite Occluded Objects for Compositional Scene Representation

Model2Scene: Learning 3D Scene Representation via Contrastive Language-CAD Models Pre-training

Unsupervised Object-Centric Learning from Multiple Unspecified Viewpoints

OCTScenes: A Versatile Real-World Dataset of Tabletop Scenes for Object-Centric Learning

Compositional Scene Representation Learning via Reconstruction: A Survey

Unsupervised Learning of Compositional Scene Representations from Multiple Unspecified Viewpoints

Superpixel Segmentation Based Structural Scene Recognition.

Learning Object-Centric Representations of Multi-Object Scenes from Multiple Views

Cooperative Holistic Scene Understanding: Unifying 3D Object, Layout, and Camera Pose Estimation

Learning to Infer Unseen Attribute-Object Compositions

Panoptic Compositional Feature Field for Editable Scene Rendering with Network-Inferred Labels Via Metric Learning

Towards Scene Understanding with Detailed 3D Object Representations

Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection

Scene Recognition with Objectness, Attribute and Category Learning

Composite Object Relation Modeling for Few-Shot Scene Recognition

Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks

COAT: Measuring Object Compositionality in Emergent Representations.

Hierarchical space tiling for scene modeling