Abstract:Existing event stream-based pattern recognition models usually represent the event stream as the point cloud, voxel, image, etc., and design various deep neural networks to learn their features. Although considerable results can be achieved in simple cases, however, the model performance may be limited by monotonous modality expressions, sub-optimal fusion, and readout mechanisms. In this paper, we propose a novel dual-stream framework for event stream-based pattern recognition via differentiated fusion, termed EFV++. It models two common event representations simultaneously, i.e., event images and event voxels. The spatial and three-dimensional stereo information can be learned separately by utilizing Transformer and Graph Neural Network (GNN). We believe the features of each representation still contain both efficient and redundant features and a sub-optimal solution may be obtained if we directly fuse them without differentiation. Thus, we divide each feature into three levels and retain high-quality features, blend medium-quality features, and exchange low-quality features. The enhanced dual features will be fed into the fusion Transformer together with bottleneck features. In addition, we introduce a novel hybrid interaction readout mechanism to enhance the diversity of features as final representations. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance on multiple widely used event stream-based classification datasets. Specifically, we achieve new state-of-the-art performance on the Bullying10k dataset, i.e., $90.51\%$, which exceeds the second place by $+2.21\%$. The source code of this paper has been released on \url{<a class="link-external link-https" href="https://github.com/Event-AHU/EFV_event_classification/tree/EFVpp" rel="external noopener nofollow">this https URL</a>}.

MVF-Net: A Multi-view Fusion Network for Event-based Object Classification

Multi-View Adaptive Fusion Network for 3D Object Detection

VMV-GCN: Volumetric Multi-View Based Graph CNN for Event Stream Classification

AMVFNet: Attentive Multi-View Fusion Network for 3D Object Detection

Multi-view Instance Attention Fusion Network for Classification

Multi-View Hierarchical Fusion Network for 3D Object Retrieval and Classification

FE-Fusion-VPR: Attention-based Multi-Scale Network Architecture for Visual Place Recognition by Fusing Frames and Events

Learning Bottleneck Transformer for Event Image-Voxel Feature Fusion based Classification

Retain, Blend, and Exchange: A Quality-aware Spatial-Stereo Fusion Approach for Event Stream Recognition

An attention fusion network for event-based vehicle object detection

MVFuseNet: Improving End-to-End Object Detection and Motion Forecasting through Multi-View Fusion of LiDAR Data

Event-centric multi-modal fusion method for dense video captioning

Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification.

Multi-View Vision Fusion Network: Can 2D Pre-Trained Model Boost 3D Point Cloud Data-Scarce Learning?

VIDF-Net: A Voxel-Image Dynamic Fusion Method for 3D Object Detection

MVX-Net: Multimodal VoxelNet for 3D Object Detection

Frustum FusionNet: Amodal 3D Object Detection with Multi-Modal Feature Fusion

MFFN: Multi-view Feature Fusion Network for Camouflaged Object Detection

PVNet: A Joint Convolutional Network of Point Cloud and Multi-View for 3D Shape Recognition.

MANet: Multimodal Attention Network based Point- View fusion for 3D Shape Recognition

MAFusion: Multiscale Attention Network for Infrared and Visible Image Fusion