Abstract:3D dense captioning, as an emerging vision-language task, aims to identify and locate each object from a set of point clouds and generate a distinctive natural language sentence for describing each located object. However, the existing methods mainly focus on mining inter-object relationship, while ignoring contextual information, especially the non-object details and background environment within the point clouds, thus leading to low-quality descriptions, such as inaccurate relative position information. In this paper, we make the first attempt to utilize the point clouds clustering features as the contextual information to supply the non-object details and background environment of the point clouds and incorporate them into the 3D dense captioning task. We propose two separate modules, namely the Global Context Modeling (GCM) and Local Context Modeling (LCM), in a coarse-to-fine manner to perform the contextual modeling of the point clouds. Specifically, the GCM module captures the inter-object relationship among all objects with global contextual information to obtain more complete scene information of the whole point clouds. The LCM module exploits the influence of the neighboring objects of the target object and local contextual information to enrich the object representations. With such global and local contextual modeling strategies, our proposed model can effectively characterize the object representations and contextual information and thereby generate comprehensive and detailed descriptions of the located objects. Extensive experiments on the ScanRefer and Nr3D datasets demonstrate that our proposed method sets a new record on the 3D dense captioning task, and verify the effectiveness of our raised contextual modeling of point clouds.

Modeling Local and Global Contexts for Image Captioning

CSTNET: ENHANCING GLOBAL-TO-LOCAL INTERACTIONS FOR IMAGE CAPTIONING

Towards local visual modeling for image captioning

Local-global Visual Interaction Attention for Image Captioning

Context and Attribute Grounded Dense Captioning

Hierarchical decoding with latent context for image captioning

Exploring Overall Contextual Information for Image Captioning in Human-Like Cognitive Style

Image Caption with Global-Local Attention

Fine-Grained Image Captioning with Global-Local Discriminative Objective.

Exploring Visual Relationship for Image Captioning

Contextual Modeling for 3D Dense Captioning on Point Clouds

GLCM: Global–Local Captioning Model for Remote Sensing Image Captioning

Intra-Image Region Context for Image Captioning

Improving Image Captioning via Enhancing Dual-Side Context Awareness

Local-to-Global Semantic Supervised Learning for Image Captioning

Region-Aware Image Captioning Via Interaction Learning

LG-MLFormer: Local and Global MLP for Image Captioning

Context-Aware Transformer for image captioning

Aligning Linguistic Words and Visual Semantic Units for Image Captioning

Contextual and Selective Attention Networks for Image Captioning