RPEFlow: Multimodal Fusion of RGB-PointCloud-Event for Joint Optical Flow and Scene Flow Estimation

Zhexiong Wan,Yuxin Mao,Jing Zhang,Yuchao Dai
DOI: https://doi.org/10.48550/arXiv.2309.15082
2023-09-27
Abstract:Recently, the RGB images and point clouds fusion methods have been proposed to jointly estimate 2D optical flow and 3D scene flow. However, as both conventional RGB cameras and LiDAR sensors adopt a frame-based data acquisition mechanism, their performance is limited by the fixed low sampling rates, especially in highly-dynamic scenes. By contrast, the event camera can asynchronously capture the intensity changes with a very high temporal resolution, providing complementary dynamic information of the observed scenes. In this paper, we incorporate RGB images, Point clouds and Events for joint optical flow and scene flow estimation with our proposed multi-stage multimodal fusion model, RPEFlow. First, we present an attention fusion module with a cross-attention mechanism to implicitly explore the internal cross-modal correlation for 2D and 3D branches, respectively. Second, we introduce a mutual information regularization term to explicitly model the complementary information of three modalities for effective multimodal feature learning. We also contribute a new synthetic dataset to advocate further research. Experiments on both synthetic and real datasets show that our model outperforms the existing state-of-the-art by a wide margin. Code and dataset is available at <a class="link-external link-https" href="https://npucvr.github.io/RPEFlow" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the problem of jointly estimating 2D optical flow and 3D scene flow in complex dynamic scenes. Specifically, traditional RGB cameras and LiDAR sensors, due to their fixed frame rate data acquisition mechanisms, are limited in performance when dealing with highly dynamic scenes. In contrast, event cameras can asynchronously capture brightness changes with very high temporal resolution, providing dynamic information about the observed scene. Therefore, the paper proposes a multimodal fusion model, RPEFlow, which combines RGB images, point clouds, and event data to improve the accuracy of 2D optical flow and 3D scene flow estimation. ### Main Contributions 1. **Introduction of Event Cameras**: Combining event cameras with RGB cameras and LiDAR sensors for joint estimation of 2D optical flow and 3D scene flow in complex dynamic scenes, constituting a new and practical problem. 2. **Multimodal Attention Fusion Module**: Proposing an implicit multimodal attention fusion module that explores the internal correlations among the three modalities through a cross-attention mechanism. 3. **Explicit Mutual Information Regularization**: Introducing an explicit mutual information regularization term to maximize the complementary information among the three modalities, achieving effective multimodal feature learning. 4. **Large-Scale Synthetic Dataset**: Contributing a new large-scale synthetic dataset that includes simulated data conforming to gravity models and collision detection, as well as a wider variety of moving objects and rich annotations. ### Method Overview 1. **Multimodal Attention Fusion (MAF)**: - In the 2D branch, project point cloud features onto the image plane and fuse them with auxiliary features (events and point clouds). - In the 3D branch, project image and event features into 3D space and fuse them with point cloud features. - Use a cross-attention mechanism to explore the correlations among different modalities. 2. **Mutual Information Regularization (MIR)**: - Explicitly model cross-modal dependencies by minimizing mutual information, using a variational upper bound and Gaussian latent codes to compute the mutual information regularization term. 3. **Pyramid Multi-Stage Fusion Framework**: - Estimate 2D optical flow and 3D scene flow from coarse to fine through multi-stage feature fusion. - Perform multimodal attention fusion and mutual information regularization at each fusion stage to fully utilize the complementary information provided by event data. ### Experimental Results - **Synthetic Data**: Experiments on the FlyingThings3D and EKubric datasets show that RPEFlow significantly outperforms existing methods in estimating 2D optical flow and 3D scene flow. - **Real Data**: Experiments on the DSEC dataset further validate the superior performance of RPEFlow in real-world scenarios, especially in highly dynamic and detailed moving or motion-blurred regions. ### Conclusion By introducing event cameras and combining RGB images and point cloud data, RPEFlow achieves more accurate 2D optical flow and 3D scene flow estimation in complex dynamic scenes. Experimental results validate the effectiveness of multimodal attention fusion and mutual information regularization, demonstrating the potential of this method in practical applications.