Abstract:3D dense captioning stands as a cornerstone in achieving a comprehensive understanding of 3D scenes through natural language. It has recently witnessed remarkable achievements, particularly in indoor settings. However, the exploration of 3D dense captioning in outdoor scenes is hindered by two major challenges: 1) the domain gap between indoor and outdoor scenes, such as dynamics and sparse visual inputs, makes it difficult to directly adapt existing indoor methods; 2) the lack of data with comprehensive box-caption pair annotations specifically tailored for outdoor scenes. To this end, we introduce the new task of outdoor 3D dense captioning. As input, we assume a LiDAR point cloud and a set of RGB images captured by the panoramic camera rig. The expected output is a set of object boxes with captions. To tackle this task, we propose the TOD3Cap network, which leverages the BEV representation to generate object box proposals and integrates Relation Q-Former with LLaMA-Adapter to generate rich captions for these objects. We also introduce the TOD3Cap dataset, the largest one to our knowledge for 3D dense captioning in outdoor scenes, which contains 2.3M descriptions of 64.3K outdoor objects from 850 scenes. Notably, our TOD3Cap network can effectively localize and caption 3D objects in outdoor scenes, which outperforms baseline methods by a significant margin (+9.6 CiDEr@0.5IoU). Code, data, and models are publicly available at <a class="link-external link-https" href="https://github.com/jxbbb/TOD3Cap" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **the challenges of achieving 3D dense captioning in outdoor scenes**. Specifically, the paper focuses on how to use LiDAR point clouds and panoramic RGB images as input to generate bounding boxes of all objects in outdoor scenes and their natural - language descriptions. ### Main Problems and Challenges 1. **Domain Gap**: - **Dynamic vs Static**: Outdoor scenes are usually dynamic and require the detection and tracking of objects that change over time. - **Sparse LiDAR Point Clouds**: LiDAR point clouds in outdoor scenes are usually sparser, which poses a challenge to shape understanding. - **Fixed Camera Viewpoint**: Outdoor scenes usually use a fixed 6 - camera device, resulting in a higher self - occlusion problem. - **Larger Area**: Outdoor scenes usually cover a larger area. 2. **Data Scarcity**: - There is a lack of datasets specifically for outdoor scenes with comprehensive bounding box - caption pair annotations, which makes it difficult to directly adapt existing indoor methods. ### Solutions To solve these problems, the authors propose a new task - **outdoor 3D dense captioning** and develop the following for this purpose: 1. **TOD3Cap Network**: - Generate object bounding box proposals through BEV (Bird - Eye - View) representation and combine Relation Q - Former and LLaMA - Adapter to generate rich captions. - Utilize multi - modal input (LiDAR point clouds and panoramic RGB images) and generate object proposals through a query mechanism. - Use Adapter to convert visual features into prompts for the language model, thereby generating dense captions. 2. **TOD3Cap Dataset**: - Construct a large - scale multi - modal dataset, extend the nuScenes dataset, containing 2.3 million descriptions, involving 64.3 thousand outdoor objects from 850 scenes. - Each object in the dataset is annotated with information in four aspects: appearance, motion, environment, and relationship. ### Experimental Results The experimental results show that the TOD3Cap network significantly outperforms existing benchmark methods on multiple evaluation metrics, especially with performance improvements of 9.6 and 10.2 percentage points on C@0.25 and C@0.5 respectively. ### Summary The main contributions of the paper include: - Introducing the outdoor 3D dense captioning task and proposing the TOD3Cap network to solve this task. - Providing the TOD3Cap dataset, which is currently the largest outdoor 3D dense captioning dataset. - The experimental results show that the TOD3Cap network significantly outperforms existing methods in the outdoor 3D dense captioning task. Through these contributions, the paper provides a solid foundation for future research and promotes the development of outdoor 3D scene understanding.

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

3D-SSD: Learning Hierarchical Features from RGB-D Images for Amodal 3D Object Detection

Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds

A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes

Dense captioning and multidimensional evaluations for indoor robotic scenes

MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes

Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization

Contextual Modeling for 3D Dense Captioning on Point Clouds

Scalable 3D Captioning with Pretrained Models

Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

End-to-End 3D Dense Captioning with Vote2Cap-DETR

Explore and Tell: Embodied Visual Captioning in 3D Environments

Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving

CO^3: Cooperative Unsupervised 3D Representation Learning for Autonomous Driving

Complete 3D Relationships Extraction Modality Alignment Network for 3D Dense Captioning.

Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning

DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context Graph and Relation-based Optimization

Bi-directional Contextual Attention for 3D Dense Captioning

View Selection for 3D Captioning via Diffusion Ranking

RPCS v2.0: Object-detection-based recurrent point cloud selection method for 3D dense captioning