MESEN: Exploit Multimodal Data to Design Unimodal Human Activity Recognition with Few Labels

Lilin Xu,Chaojie Gu,Rui Tan,Shibo He,Jiming Chen
2024-04-02
Abstract:Human activity recognition (HAR) will be an essential function of various emerging applications. However, HAR typically encounters challenges related to modality limitations and label scarcity, leading to an application gap between current solutions and real-world requirements. In this work, we propose MESEN, a multimodal-empowered unimodal sensing framework, to utilize unlabeled multimodal data available during the HAR model design phase for unimodal HAR enhancement during the deployment phase. From a study on the impact of supervised multimodal fusion on unimodal feature extraction, MESEN is designed to feature a multi-task mechanism during the multimodal-aided pre-training stage. With the proposed mechanism integrating cross-modal feature contrastive learning and multimodal pseudo-classification aligning, MESEN exploits unlabeled multimodal data to extract effective unimodal features for each modality. Subsequently, MESEN can adapt to downstream unimodal HAR with only a few labeled samples. Extensive experiments on eight public multimodal datasets demonstrate that MESEN achieves significant performance improvements over state-of-the-art baselines in enhancing unimodal HAR by exploiting multimodal data.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in the real world, human activity recognition (HAR) applications face the challenges of modal limitations and label scarcity, resulting in an application gap between current solutions and actual needs. Specifically, the paper focuses on how to effectively utilize unlabeled multimodal data to improve the performance of unimodal HAR with only a small number of labels. In practical applications, due to the high cost and time - consuming nature of annotation, only a small number of annotated samples are often available, while unannotated data is relatively easy to obtain. In addition, although multimodal research is becoming increasingly prominent, unimodal HAR is still the most typical form of application. Therefore, the paper proposes a framework named MESEN, aiming to use unlabeled multimodal data to design a unimodal HAR model with a small number of labels, thereby achieving a general performance improvement. MESEN solves the above problems in the following ways: 1. **Multi - task mechanism**: In the multimodal - assisted pre - training stage, MESEN integrates cross - modal feature contrastive learning and multimodal pseudo - classification alignment to extract effective unimodal features using unlabeled multimodal data. 2. **Cross - modal feature contrastive learning**: This method emphasizes the similarity between paired multimodal features to capture the inter - modal correlation and maintain the modal difference by excluding the consideration of intra - modal differences. 3. **Multimodal pseudo - classification alignment**: Utilize multimodal correlation in the representation space of the classification stage, and further improve the generalization ability of the model through the pseudo - classification task as a hint for the downstream recognition task. Through these methods, MESEN can effectively utilize unlabeled multimodal data and improve the performance of unimodal HAR in the case of only a small number of labeled samples.