Abstract:Accurate perception of the dynamic environment is a fundamental task for autonomous driving and robot systems. This paper introduces Let Occ Flow, the first self-supervised work for joint 3D occupancy and occupancy flow prediction using only camera inputs, eliminating the need for 3D annotations. Utilizing TPV for unified scene representation and deformable attention layers for feature aggregation, our approach incorporates a novel attention-based temporal fusion module to capture dynamic object dependencies, followed by a 3D refine module for fine-gained volumetric representation. Besides, our method extends differentiable rendering to 3D volumetric flow fields, leveraging zero-shot 2D segmentation and optical flow cues for dynamic decomposition and motion optimization. Extensive experiments on nuScenes and KITTI datasets demonstrate the competitive performance of our approach over prior state-of-the-art methods. Our project page is available at <a class="link-external link-https" href="https://eliliu2233.github.io/letoccflow/" rel="external noopener nofollow">this https URL</a>
What problem does this paper attempt to address?
### Problems the paper attempts to solve
The paper aims to solve the problem of accurate perception of dynamic environments in autonomous driving and robotic systems. Specifically, the paper proposes **Let Occ Flow**, which is the first self - supervised method that uses only camera inputs and can simultaneously predict 3D occupancy and occupancy flow without 3D annotations. This method eliminates the need for 3D labeling, uses Tri - perspective View (TPV) for unified scene representation, and performs feature aggregation through deformable attention layers. In addition, this method introduces an attention - based temporal fusion module to capture the dependencies of dynamic objects and generates a fine - grained volumetric representation through a 3D refinement module. To optimize scene geometry and object motion, this method extends the differentiable rendering technique and uses zero - shot 2D segmentation and optical flow cues for dynamic decomposition and motion optimization.
### Main contributions
1. **Propose Let Occ Flow**: This is the first self - supervised method that realizes the joint prediction of 3D occupancy and occupancy flow by integrating 2D optical flow cues.
2. **Design a new attention - based temporal fusion module**: Effectively capture the long - distance dependencies of dynamic objects, and further propose a flow - oriented optimization strategy to alleviate training instability and sample imbalance problems.
3. **Extensive experimental verification**: Conducted a large number of experiments on the nuScenes and KITTI datasets, demonstrating the competitive performance of this method compared to the existing state - of - the - art methods.
### Method overview
1. **Problem definition**: The goal is to predict 3D occupancy \(O_t\) and occupancy flow \(F_t\) at time step \(t\) using a time series of multi - view camera inputs.
2. **2D - to - 3D encoder**: Use TPV to construct a unified representation from multi - view image inputs, and integrate multi - view multi - scale image features through deformable cross - attention layers and feed - forward networks.
3. **Temporal fusion module**: Enhance the scene representation through ego - motion alignment and backward - forward attention module (BFAM), and capture the temporal information of geometry and object motion.
4. **3D refinement module**: Further aggregate spatial features through a residual 3D convolutional network, and generate a high - resolution refined feature volume through 3D deconvolution upsampling.
5. **Rendering - based optimization**: Optimize 3D occupancy and occupancy flow through the differentiable rendering technique, using reprojection photometric loss, optical flow cues, and optional LiDAR ray supervision.
6. **Flow - oriented optimization**: Introduce a two - stage optimization strategy and a dynamic decoupling scheme to effectively alleviate the instability and sample imbalance problems in the joint optimization of geometry and motion.
### Experimental results
- **Self - supervised 3D occupancy prediction**: The experimental results on the SemanticKITTI dataset show that this method achieves the state - of - the - art performance in the latest 3D occupancy prediction and depth estimation tasks without LiDAR supervision.
- **Self - supervised occupancy and occupancy flow prediction**: Experiments on the KITTI - MOT and nuScenes datasets show that this method significantly outperforms the existing rendering - based methods in occupancy and occupancy flow prediction.
- **Ablation study**: Verified the effectiveness of the temporal fusion module, optimization strategy, and static flow supervision through systematic ablation analysis.
### Discussion
Although using temporal inputs can make better use of historical information, due to the limitations of rendering - based, this model cannot fully handle the occlusion problem. Future research can explore methods for long - term occupancy flow modeling and expanding the visible range using temporal supervision. In addition, the accuracy of occupancy flow prediction depends on the quality of optical flow cues, and future work can focus more on improving the quality of flow supervision. Finally, the current occupancy flow prediction does not explicitly enforce instance consistency, and future research can explore integrating instance - awareness into occupancy flow prediction.