Abstract:Recently, the RGB images and point clouds fusion methods have been proposed to jointly estimate 2D optical flow and 3D scene flow. However, as both conventional RGB cameras and LiDAR sensors adopt a frame-based data acquisition mechanism, their performance is limited by the fixed low sampling rates, especially in highly-dynamic scenes. By contrast, the event camera can asynchronously capture the intensity changes with a very high temporal resolution, providing complementary dynamic information of the observed scenes. In this paper, we incorporate RGB images, Point clouds and Events for joint optical flow and scene flow estimation with our proposed multi-stage multimodal fusion model, RPEFlow. First, we present an attention fusion module with a cross-attention mechanism to implicitly explore the internal cross-modal correlation for 2D and 3D branches, respectively. Second, we introduce a mutual information regularization term to explicitly model the complementary information of three modalities for effective multimodal feature learning. We also contribute a new synthetic dataset to advocate further research. Experiments on both synthetic and real datasets show that our model outperforms the existing state-of-the-art by a wide margin. Code and dataset is available at <a class="link-external link-https" href="https://npucvr.github.io/RPEFlow" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the problem of jointly estimating 2D optical flow and 3D scene flow in complex dynamic scenes. Specifically, traditional RGB cameras and LiDAR sensors, due to their fixed frame rate data acquisition mechanisms, are limited in performance when dealing with highly dynamic scenes. In contrast, event cameras can asynchronously capture brightness changes with very high temporal resolution, providing dynamic information about the observed scene. Therefore, the paper proposes a multimodal fusion model, RPEFlow, which combines RGB images, point clouds, and event data to improve the accuracy of 2D optical flow and 3D scene flow estimation. ### Main Contributions 1. **Introduction of Event Cameras**: Combining event cameras with RGB cameras and LiDAR sensors for joint estimation of 2D optical flow and 3D scene flow in complex dynamic scenes, constituting a new and practical problem. 2. **Multimodal Attention Fusion Module**: Proposing an implicit multimodal attention fusion module that explores the internal correlations among the three modalities through a cross-attention mechanism. 3. **Explicit Mutual Information Regularization**: Introducing an explicit mutual information regularization term to maximize the complementary information among the three modalities, achieving effective multimodal feature learning. 4. **Large-Scale Synthetic Dataset**: Contributing a new large-scale synthetic dataset that includes simulated data conforming to gravity models and collision detection, as well as a wider variety of moving objects and rich annotations. ### Method Overview 1. **Multimodal Attention Fusion (MAF)**: - In the 2D branch, project point cloud features onto the image plane and fuse them with auxiliary features (events and point clouds). - In the 3D branch, project image and event features into 3D space and fuse them with point cloud features. - Use a cross-attention mechanism to explore the correlations among different modalities. 2. **Mutual Information Regularization (MIR)**: - Explicitly model cross-modal dependencies by minimizing mutual information, using a variational upper bound and Gaussian latent codes to compute the mutual information regularization term. 3. **Pyramid Multi-Stage Fusion Framework**: - Estimate 2D optical flow and 3D scene flow from coarse to fine through multi-stage feature fusion. - Perform multimodal attention fusion and mutual information regularization at each fusion stage to fully utilize the complementary information provided by event data. ### Experimental Results - **Synthetic Data**: Experiments on the FlyingThings3D and EKubric datasets show that RPEFlow significantly outperforms existing methods in estimating 2D optical flow and 3D scene flow. - **Real Data**: Experiments on the DSEC dataset further validate the superior performance of RPEFlow in real-world scenarios, especially in highly dynamic and detailed moving or motion-blurred regions. ### Conclusion By introducing event cameras and combining RGB images and point cloud data, RPEFlow achieves more accurate 2D optical flow and 3D scene flow estimation in complex dynamic scenes. Experimental results validate the effectiveness of multimodal attention fusion and mutual information regularization, demonstrating the potential of this method in practical applications.

RPEFlow: Multimodal Fusion of RGB-PointCloud-Event for Joint Optical Flow and Scene Flow Estimation

Unsupervised Learning of Scene Flow Estimation Fusing with Local Rigidity.

Recurrent Volume-based 3D Feature Fusion for Real-time Multi-view Object Pose Estimation

Recurrent Volume-Based 3-D Feature Fusion for Real-Time Multiview Object Pose Estimation.

Bring Event into RGB and LiDAR: Hierarchical Visual-Motion Fusion for Scene Flow

Attentive Multimodal Fusion for Optical and Scene Flow

Spatially-guided Temporal Aggregation for Robust Event-RGB Optical Flow Estimation

CamLiFlow: Bidirectional Camera-LiDAR Fusion for Joint Optical Flow and Scene Flow Estimation

Joint Flow Estimation from Point Clouds and Event Streams

Spatial-frequency attention-based optical and scene flow with cross-modal knowledge distillation

Learning Optical Flow and Scene Flow with Bidirectional Camera-LiDAR Fusion

SSRFlow: Semantic-aware Fusion with Spatial Temporal Re-embedding for Real-world Scene Flow

EPMF: Efficient Perception-Aware Multi-Sensor Fusion for 3D Semantic Segmentation

Efficient Meshflow and Optical Flow Estimation from Event Cameras

FlowFusion: Dynamic Dense RGB-D SLAM Based on Optical Flow

Cross-modal Learning for Optical Flow Estimation with Events

Optical Flow Estimation through Fusion Network based on Self-supervised Deep Learning

DeepLiDARFlow: A Deep Learning Architecture For Scene Flow Estimation Using Monocular Camera and Sparse LiDAR

Towards Anytime Optical Flow Estimation with Event Cameras

Learning Optical Flow from Event Camera with Rendered Dataset

DELFlow: Dense Efficient Learning of Scene Flow for Large-Scale Point Clouds