TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

Bu Jin,Yupeng Zheng,Pengfei Li,Weize Li,Yuhang Zheng,Sujie Hu,Xinyu Liu,Jinwei Zhu,Zhijie Yan,Haiyang Sun,Kun Zhan,Peng Jia,Xiaoxiao Long,Yilun Chen,Hao Zhao
2024-06-06
Abstract:3D dense captioning stands as a cornerstone in achieving a comprehensive understanding of 3D scenes through natural language. It has recently witnessed remarkable achievements, particularly in indoor settings. However, the exploration of 3D dense captioning in outdoor scenes is hindered by two major challenges: 1) the domain gap between indoor and outdoor scenes, such as dynamics and sparse visual inputs, makes it difficult to directly adapt existing indoor methods; 2) the lack of data with comprehensive box-caption pair annotations specifically tailored for outdoor scenes. To this end, we introduce the new task of outdoor 3D dense captioning. As input, we assume a LiDAR point cloud and a set of RGB images captured by the panoramic camera rig. The expected output is a set of object boxes with captions. To tackle this task, we propose the TOD3Cap network, which leverages the BEV representation to generate object box proposals and integrates Relation Q-Former with LLaMA-Adapter to generate rich captions for these objects. We also introduce the TOD3Cap dataset, the largest one to our knowledge for 3D dense captioning in outdoor scenes, which contains 2.3M descriptions of 64.3K outdoor objects from 850 scenes. Notably, our TOD3Cap network can effectively localize and caption 3D objects in outdoor scenes, which outperforms baseline methods by a significant margin (+9.6 CiDEr@0.5IoU). Code, data, and models are publicly available at <a class="link-external link-https" href="https://github.com/jxbbb/TOD3Cap" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **the challenges of achieving 3D dense captioning in outdoor scenes**. Specifically, the paper focuses on how to use LiDAR point clouds and panoramic RGB images as input to generate bounding boxes of all objects in outdoor scenes and their natural - language descriptions. ### Main Problems and Challenges 1. **Domain Gap**: - **Dynamic vs Static**: Outdoor scenes are usually dynamic and require the detection and tracking of objects that change over time. - **Sparse LiDAR Point Clouds**: LiDAR point clouds in outdoor scenes are usually sparser, which poses a challenge to shape understanding. - **Fixed Camera Viewpoint**: Outdoor scenes usually use a fixed 6 - camera device, resulting in a higher self - occlusion problem. - **Larger Area**: Outdoor scenes usually cover a larger area. 2. **Data Scarcity**: - There is a lack of datasets specifically for outdoor scenes with comprehensive bounding box - caption pair annotations, which makes it difficult to directly adapt existing indoor methods. ### Solutions To solve these problems, the authors propose a new task - **outdoor 3D dense captioning** and develop the following for this purpose: 1. **TOD3Cap Network**: - Generate object bounding box proposals through BEV (Bird - Eye - View) representation and combine Relation Q - Former and LLaMA - Adapter to generate rich captions. - Utilize multi - modal input (LiDAR point clouds and panoramic RGB images) and generate object proposals through a query mechanism. - Use Adapter to convert visual features into prompts for the language model, thereby generating dense captions. 2. **TOD3Cap Dataset**: - Construct a large - scale multi - modal dataset, extend the nuScenes dataset, containing 2.3 million descriptions, involving 64.3 thousand outdoor objects from 850 scenes. - Each object in the dataset is annotated with information in four aspects: appearance, motion, environment, and relationship. ### Experimental Results The experimental results show that the TOD3Cap network significantly outperforms existing benchmark methods on multiple evaluation metrics, especially with performance improvements of 9.6 and 10.2 percentage points on C@0.25 and C@0.5 respectively. ### Summary The main contributions of the paper include: - Introducing the outdoor 3D dense captioning task and proposing the TOD3Cap network to solve this task. - Providing the TOD3Cap dataset, which is currently the largest outdoor 3D dense captioning dataset. - The experimental results show that the TOD3Cap network significantly outperforms existing methods in the outdoor 3D dense captioning task. Through these contributions, the paper provides a solid foundation for future research and promotes the development of outdoor 3D scene understanding.