Abstract:Human activity recognition (HAR) enables real-time monitoring of human movement, posture, and activity level, and can provide valuable information for health management. With the continuous advancement of Internet of Things (IoT) technology, wearable sensors and smartphones equipped with various types of sensors have become widely utilized to collect multimodal data for HAR. However, in multimodal HAR, current fusion methods fall short in capturing inter-modality correlations, hampering the full exploitation of complementary information between modalities and leading to lower recognition accuracy. We thus propose a novel multiscale cross-modal interactive fusion network (MCIFN), which can fully capture correlations between various modalities and obtain an effective fused representation for HAR. Specifically, we employ a multiscale parallel convolution module to extract features from each modality at multiple scales. Then, an interactive fusion strategy based on the cross-modal attention mechanism is introduced to adjust and enhance each modality based on its correlations with other modalities. Additionally, to resolve the information redundancy caused by the interactive fusion strategy, we utilize a hybrid attention module to focus on important information in the fusion representation. Extensive experiments conducted on three publicly available datasets and one private dataset demonstrate that our proposed network outperforms the previous baseline networks for HAR. Additionally, our proposed fusion strategy yielded a notable improvement in accuracy ranging from 1.87% to 9.96% compared to existing strategies. These findings imply that our newly proposed network can realize comprehensive multimodal fusion and effectively enhance HAR accuracy, potentially contributing to advancements in individual health management and personalized healthcare interventions.

Multi-Cue Information Fusion For Two-Layer Activity Recognition

Hierarchical Complex Activity Representation and Recognition Using Topic Model and Classifier Level Fusion.

Sample Intercorrelation-Based Multidomain Fusion Network for Aquatic Human Activity Recognition Using Millimeter-Wave Radar.

Activity Recognition Exploiting Classifier Level Fusion of Acceleration and Physiological Signals

Human Activity Recognition by Multi-Modal Context Fusion

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

Multilevel Depth and Image Fusion for Human Activity Detection.

Multi-Stage Based Feature Fusion of Multi-Modal Data for Human Activity Recognition

Deep Fusion of Multiple Semantic Cues for Complex Event Recognition

Multi-scale Spatiotemporal Information Fusion Network for Video Action Recognition

Human Interaction Recognition Method Based on Parallel Multi-Feature Fusion Network

Multi-Sensor Data Fusion for Accurate Human Activity Recognition with Deep Learning

Multi-resolution Fusion Convolutional Network for Open Set Human Activity Recognition

A Multiscale Cross-Modal Interactive Fusion Network for Human Activity Recognition Using Wearable Sensors and Smartphones

Multi-modality Fusion Network for Action Recognition.

Human-centric multimodal fusion network for robust action recognition

Human Behavior Recognition Based on CNN-LSTM Hybrid and Multi-Sensing Feature Information Fusion

Confusion Mixup Regularized Multimodal Fusion Network for Continual Egocentric Activity Recognition

Multi-cue Combination Network for Action-Based Video Classification.

M&M: Recognizing Multiple Co-evolving Activities from Multi-source Videos

Multi-Sensor Data Fusion and CNN-LSTM Model for Human Activity Recognition System