Abstract:Understanding the motion states of the surrounding environment is critical for safe autonomous driving. These motion states can be accurately derived from scene flow, which captures the three-dimensional motion field of points. Existing LiDAR scene flow methods extract spatial features from each point cloud and then fuse them channel-wise, resulting in the implicit extraction of spatio-temporal features. Furthermore, they utilize 2D Bird's Eye View and process only two frames, missing crucial spatial information along the Z-axis and the broader temporal context, leading to suboptimal performance. To address these limitations, we propose Flow4D, which temporally fuses multiple point clouds after the 3D intra-voxel feature encoder, enabling more explicit extraction of spatio-temporal features through a 4D voxel network. However, while using 4D convolution improves performance, it significantly increases the computational load. For further efficiency, we introduce the Spatio-Temporal Decomposition Block (STDB), which combines 3D and 1D convolutions instead of using heavy 4D convolution. In addition, Flow4D further improves performance by using five frames to take advantage of richer temporal information. As a result, the proposed method achieves a 45.9% higher performance compared to the state-of-the-art while running in real-time, and won 1st place in the 2024 Argoverse 2 Scene Flow Challenge. The code is available at <a class="link-external link-https" href="https://github.com/dgist-cvlab/Flow4D" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the performance and efficiency problems of LiDAR scene flow estimation in autonomous driving. Specifically, the existing LiDAR scene flow methods have the following limitations: 1. **Loss of spatial information**: Existing methods usually use 2D Bird's Eye View (BEV) representation, which leads to the loss of spatial information on the Z - axis. 2. **Insufficient temporal information**: Most methods only process two frames and fail to fully utilize the richer historical temporal information. 3. **Inadequate spatio - temporal feature extraction**: Traditional methods first extract spatial features from each point cloud and then obtain temporal correlation through channel fusion. This approach cannot explicitly extract spatio - temporal features. To solve these problems, the paper proposes Flow4D, a LiDAR scene flow estimation framework based on 4D voxel networks. The main improvements of Flow4D include: - **4D voxel representation**: By adding a time dimension on the basis of 3D voxels to form a 4D voxel representation, spatio - temporal features can be explicitly extracted. - **Spatio - Temporal Decomposition Block (STDB)**: To reduce the computational burden of 4D convolution, a method of decomposing 4D convolution into 3D spatial convolution and 1D temporal convolution is proposed. - **Multi - frame fusion**: Five consecutive frames are used to capture richer spatio - temporal information, thereby improving the accuracy of scene flow estimation. These improvements enable Flow4D to achieve a 45.9% higher performance than existing methods on the Argoverse 2 dataset and maintain high computational efficiency in real - time operation. ### Formula summary - **Scene flow vector decomposition**: \[ F_{t,t + 1}=F_{t,t + 1}^{\text{ego}}+F_{t,t + 1}^{\text{motion}} \] where \(F_{t,t + 1}^{\text{ego}}\) represents the ego - vehicle motion and \(F_{t,t + 1}^{\text{motion}}\) represents the motion vectors of each point. - **Voxelized feature extraction**: - Initial point feature \(f_{\tau}^{p}\in\mathbb{R}^{N_{\tau}\times16}\) - Initial voxel feature \(f_{\tau}^{v}\in\mathbb{R}^{W\times L\times H\times16}\) - **4D voxel feature**: \[ f^{4D}\in\mathbb{R}^{W\times L\times H\times5\times16} \] These formulas and methods work together to enable Flow4D to achieve significant performance improvement in the scene flow estimation task.

Flow4D: Leveraging 4D Voxel Network for LiDAR Scene Flow Estimation

Unsupervised Learning of Scene Flow Estimation Fusing with Local Rigidity.

DeepLiDARFlow: A Deep Learning Architecture For Scene Flow Estimation Using Monocular Camera and Sparse LiDAR

LiDAR-Flow: Dense Scene Flow Estimation from Sparse LiDAR and Stereo Images

D^3FlowSLAM: Self-Supervised Dynamic SLAM with Flow Motion Decomposition and DINO Guidance

DeFlow: Decoder of Scene Flow Network in Autonomous Driving

ICP-Flow: LiDAR Scene Flow Estimation with ICP

STARFlow: Spatial Temporal Feature Re-embedding with Attentive Learning for Real-world Scene Flow

SeFlow: A Self-Supervised Scene Flow Method in Autonomous Driving

DELFlow: Dense Efficient Learning of Scene Flow for Large-Scale Point Clouds

ZeroFlow: Scalable Scene Flow via Distillation

FeatFlow: Learning Geometric Features for 3D Motion Estimation

Let-It-Flow: Simultaneous Optimization of 3D Flow and Object Clustering

3DSFLabelling: Boosting 3D Scene Flow Estimation by Pseudo Auto-labelling

Learning Optical Flow and Scene Flow with Bidirectional Camera-LiDAR Fusion

SplatFlow: Self-Supervised Dynamic Gaussian Splatting in Neural Motion Flow Field for Autonomous Driving

FlowFusion: Dynamic Dense RGB-D SLAM Based on Optical Flow

3D Scene Flow Estimation on Pseudo-LiDAR: Bridging the Gap on Estimating Point Motion

Hierarchical Attention Learning of Scene Flow in 3D Point Clouds

CamLiFlow: Bidirectional Camera-LiDAR Fusion for Joint Optical Flow and Scene Flow Estimation

Self-Supervised Scene Flow Estimation with Point-Voxel Fusion and Surface Representation