Abstract:Human Activity Recognition (HAR) is a longstanding problem in AI with applications in a broad range of areas, including healthcare, sports and fitness, security, and more. The performance of HAR in real-world settings is strongly dependent on the type and quality of the input signal that can be acquired. Given an unobstructed, high-quality camera view of a scene, computer vision systems, in particular in conjunction with foundation models, can today fairly reliably distinguish complex activities. On the other hand, recognition using modalities such as wearable sensors (which are often more broadly available, e.g., in mobile phones and smartwatches) is a more difficult problem, as the signals often contain less information and labeled training data is more difficult to acquire. To alleviate the need for labeled data, we introduce our comprehensive Fitness Multimodal Activity Dataset (FiMAD) in this work, which can be used with the proposed pre-training method MuJo (Multimodal Joint Feature Space Learning) to enhance HAR performance across various modalities. FiMAD was created using YouTube fitness videos and contains parallel video, language, pose, and simulated IMU sensor data. MuJo utilizes this dataset to learn a joint feature space for these modalities. We show that classifiers pre-trained on FiMAD can increase the performance on real HAR datasets such as MM-Fit, MyoGym, MotionSense, and MHEALTH. For instance, on MM-Fit, we achieve an Macro F1-Score of up to 0.855 when fine-tuning on only 2% of the training data and 0.942 when utilizing the full training set for classification tasks. We have compared our approach to other self-supervised ones and showed that, unlike them, ours can consistently improve on the baseline network performance as well as provide a better data-efficiency.

Modaldrop: Modality-Aware Regularization for Temporal-Spectral Fusion in Human Activity Recognition

Modality-invariant Temporal Representation Learning for Multimodal Sentiment Classification

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

Multi-Stage Based Feature Fusion of Multi-Modal Data for Human Activity Recognition

Hierarchical Multi-View Aggregation Network for Sensor-Based Human Activity Recognition.

X-Fi: A Modality-Invariant Foundation Model for Multimodal Human Sensing

MMTSA: Multimodal Temporal Segment Attention Network for Efficient Human Activity Recognition

Adaptive Multimodal Fusion Framework for Activity Monitoring of People with Mobility Disability

ATFA: Adversarial Time–Frequency Attention network for sensor-based multimodal human activity recognition

Human Activity Recognition based on Dynamic Spatio-Temporal Relations

Skeleton Focused Human Activity Recognition in RGB Video

Multi-channel Time Series Decomposition Network For Generalizable Sensor-Based Activity Recognition

Multiscale knowledge distillation with attention based fusion for robust human activity recognition

A Novel Two Stream Decision Level Fusion of Vision and Inertial Sensors Data for Automatic Multimodal Human Activity Recognition System

Multi-Sensor Data Fusion for Accurate Human Activity Recognition with Deep Learning

MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition

DanHAR: Dual Attention Network for multimodal human activity recognition using wearable sensors

Modality Consistency-Guided Contrastive Learning for Wearable-Based Human Activity Recognition

A Semisupervised Recurrent Convolutional Attention Model for Human Activity Recognition

A Multimodal Fusion Approach for Human Activity Recognition

Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition