Abstract:Human activity recognition (HAR) will be an essential function of various emerging applications. However, HAR typically encounters challenges related to modality limitations and label scarcity, leading to an application gap between current solutions and real-world requirements. In this work, we propose MESEN, a multimodal-empowered unimodal sensing framework, to utilize unlabeled multimodal data available during the HAR model design phase for unimodal HAR enhancement during the deployment phase. From a study on the impact of supervised multimodal fusion on unimodal feature extraction, MESEN is designed to feature a multi-task mechanism during the multimodal-aided pre-training stage. With the proposed mechanism integrating cross-modal feature contrastive learning and multimodal pseudo-classification aligning, MESEN exploits unlabeled multimodal data to extract effective unimodal features for each modality. Subsequently, MESEN can adapt to downstream unimodal HAR with only a few labeled samples. Extensive experiments on eight public multimodal datasets demonstrate that MESEN achieves significant performance improvements over state-of-the-art baselines in enhancing unimodal HAR by exploiting multimodal data.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in the real world, human activity recognition (HAR) applications face the challenges of modal limitations and label scarcity, resulting in an application gap between current solutions and actual needs. Specifically, the paper focuses on how to effectively utilize unlabeled multimodal data to improve the performance of unimodal HAR with only a small number of labels. In practical applications, due to the high cost and time - consuming nature of annotation, only a small number of annotated samples are often available, while unannotated data is relatively easy to obtain. In addition, although multimodal research is becoming increasingly prominent, unimodal HAR is still the most typical form of application. Therefore, the paper proposes a framework named MESEN, aiming to use unlabeled multimodal data to design a unimodal HAR model with a small number of labels, thereby achieving a general performance improvement. MESEN solves the above problems in the following ways: 1. **Multi - task mechanism**: In the multimodal - assisted pre - training stage, MESEN integrates cross - modal feature contrastive learning and multimodal pseudo - classification alignment to extract effective unimodal features using unlabeled multimodal data. 2. **Cross - modal feature contrastive learning**: This method emphasizes the similarity between paired multimodal features to capture the inter - modal correlation and maintain the modal difference by excluding the consideration of intra - modal differences. 3. **Multimodal pseudo - classification alignment**: Utilize multimodal correlation in the representation space of the classification stage, and further improve the generalization ability of the model through the pseudo - classification task as a hint for the downstream recognition task. Through these methods, MESEN can effectively utilize unlabeled multimodal data and improve the performance of unimodal HAR in the case of only a small number of labeled samples.

MESEN: Exploit Multimodal Data to Design Unimodal Human Activity Recognition with Few Labels

Wearable Sensor Based Multimodal Human Activity Recognition Exploiting the Diversity of Classifier Ensemble.

MultiSense: Cross Labelling and Learning Human Activities Using Multimodal Sensing Data

Privacy-preserving Activity Recognition Using Multimodal Sensors in Smart Office

Poster: Cross Labelling and Learning Unknown Activities Among Multimodal Sensing Data

Multi-Stage Based Feature Fusion of Multi-Modal Data for Human Activity Recognition

Functional classification of P22 Amber mutants

Multimodal emotion recognition based on audio and text by using hybrid attention networks

MMTSA: Multimodal Temporal Segment Attention Network for Efficient Human Activity Recognition

MECOM: A Meta-Completion Network for Fine-Grained Recognition With Incomplete Multi-Modalities

MuMu: Cooperative Multitask Learning-Based Guided Multimodal Fusion

An Effective Multimodal Representation and Fusion Method for Multimodal Intent Recognition

A Novel Deep Multifeature Extraction Framework Based on Attention Mechanism Using Wearable Sensor Data for Human Activity Recognition

Self-HCL: Self-Supervised Multitask Learning with Hybrid Contrastive Learning Strategy for Multimodal Sentiment Analysis

Missing Modality Robustness in Semi-Supervised Multi-Modal Semantic Segmentation

Modality aware contrastive learning for multimodal human activity recognition

A Multi-dimensional Parallel Convolutional Connected Network Based on Multi-source and Multi-modal Sensor Data for Human Activity Recognition

A Multidimensional Parallel Convolutional Connected Network Based on Multisource and Multimodal Sensor Data for Human Activity Recognition

DanHAR: Dual Attention Network for multimodal human activity recognition using wearable sensors

ATFA: Adversarial Time–Frequency Attention network for sensor-based multimodal human activity recognition

Context-aware mutual learning for semi-supervised human activity recognition using wearable sensors