Abstract:Accurate perception of the dynamic environment is a fundamental task for autonomous driving and robot systems. This paper introduces Let Occ Flow, the first self-supervised work for joint 3D occupancy and occupancy flow prediction using only camera inputs, eliminating the need for 3D annotations. Utilizing TPV for unified scene representation and deformable attention layers for feature aggregation, our approach incorporates a novel attention-based temporal fusion module to capture dynamic object dependencies, followed by a 3D refine module for fine-gained volumetric representation. Besides, our method extends differentiable rendering to 3D volumetric flow fields, leveraging zero-shot 2D segmentation and optical flow cues for dynamic decomposition and motion optimization. Extensive experiments on nuScenes and KITTI datasets demonstrate the competitive performance of our approach over prior state-of-the-art methods. Our project page is available at <a class="link-external link-https" href="https://eliliu2233.github.io/letoccflow/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper aims to solve the problem of accurate perception of dynamic environments in autonomous driving and robotic systems. Specifically, the paper proposes **Let Occ Flow**, which is the first self - supervised method that uses only camera inputs and can simultaneously predict 3D occupancy and occupancy flow without 3D annotations. This method eliminates the need for 3D labeling, uses Tri - perspective View (TPV) for unified scene representation, and performs feature aggregation through deformable attention layers. In addition, this method introduces an attention - based temporal fusion module to capture the dependencies of dynamic objects and generates a fine - grained volumetric representation through a 3D refinement module. To optimize scene geometry and object motion, this method extends the differentiable rendering technique and uses zero - shot 2D segmentation and optical flow cues for dynamic decomposition and motion optimization. ### Main contributions 1. **Propose Let Occ Flow**: This is the first self - supervised method that realizes the joint prediction of 3D occupancy and occupancy flow by integrating 2D optical flow cues. 2. **Design a new attention - based temporal fusion module**: Effectively capture the long - distance dependencies of dynamic objects, and further propose a flow - oriented optimization strategy to alleviate training instability and sample imbalance problems. 3. **Extensive experimental verification**: Conducted a large number of experiments on the nuScenes and KITTI datasets, demonstrating the competitive performance of this method compared to the existing state - of - the - art methods. ### Method overview 1. **Problem definition**: The goal is to predict 3D occupancy \(O_t\) and occupancy flow \(F_t\) at time step \(t\) using a time series of multi - view camera inputs. 2. **2D - to - 3D encoder**: Use TPV to construct a unified representation from multi - view image inputs, and integrate multi - view multi - scale image features through deformable cross - attention layers and feed - forward networks. 3. **Temporal fusion module**: Enhance the scene representation through ego - motion alignment and backward - forward attention module (BFAM), and capture the temporal information of geometry and object motion. 4. **3D refinement module**: Further aggregate spatial features through a residual 3D convolutional network, and generate a high - resolution refined feature volume through 3D deconvolution upsampling. 5. **Rendering - based optimization**: Optimize 3D occupancy and occupancy flow through the differentiable rendering technique, using reprojection photometric loss, optical flow cues, and optional LiDAR ray supervision. 6. **Flow - oriented optimization**: Introduce a two - stage optimization strategy and a dynamic decoupling scheme to effectively alleviate the instability and sample imbalance problems in the joint optimization of geometry and motion. ### Experimental results - **Self - supervised 3D occupancy prediction**: The experimental results on the SemanticKITTI dataset show that this method achieves the state - of - the - art performance in the latest 3D occupancy prediction and depth estimation tasks without LiDAR supervision. - **Self - supervised occupancy and occupancy flow prediction**: Experiments on the KITTI - MOT and nuScenes datasets show that this method significantly outperforms the existing rendering - based methods in occupancy and occupancy flow prediction. - **Ablation study**: Verified the effectiveness of the temporal fusion module, optimization strategy, and static flow supervision through systematic ablation analysis. ### Discussion Although using temporal inputs can make better use of historical information, due to the limitations of rendering - based, this model cannot fully handle the occlusion problem. Future research can explore methods for long - term occupancy flow modeling and expanding the visible range using temporal supervision. In addition, the accuracy of occupancy flow prediction depends on the quality of optical flow cues, and future work can focus more on improving the quality of flow supervision. Finally, the current occupancy flow prediction does not explicitly enforce instance consistency, and future research can explore integrating instance - awareness into occupancy flow prediction.

Let Occ Flow: Self-Supervised 3D Occupancy Flow Prediction

Unsupervised Learning of Scene Flow Estimation Fusing with Local Rigidity.

SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction

ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Prediction

AdaOcc: Adaptive Forward View Transformation and Flow Modeling for 3D Occupancy and Flow Prediction

Self-Supervised 3D Scene Flow Estimation and Motion Prediction using Local Rigidity Prior

SeFlow: A Self-Supervised Scene Flow Method in Autonomous Driving

D^3FlowSLAM: Self-Supervised Dynamic SLAM with Flow Motion Decomposition and DINO Guidance

Weakly Supervised Learning of Rigid 3D Scene Flow

Let-It-Flow: Simultaneous Optimization of 3D Flow and Object Clustering

SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving

Implicit Occupancy Flow Fields for Perception and Prediction in Self-Driving

OccFlowNet: Towards Self-supervised Occupancy Estimation via Differentiable Rendering and Occupancy Flow

RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering Supervision

Joint Self-supervised Depth and Optical Flow Estimation towards Dynamic Objects

Learning-based 3D Occupancy Prediction for Autonomous Navigation in Occluded Environments

Occupancy Flow Fields for Motion Forecasting in Autonomous Driving

ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers

OccFusion: Depth Estimation Free Multi-sensor Fusion for 3D Occupancy Prediction

Self-Supervised Scene Flow Estimation with Point-Voxel Fusion and Surface Representation