Vi2ACT:Video-enhanced Cross-modal Co-learning with Representation Conditional Discriminator for Few-shot Human Activity Recognition

Kang Xia,Wenzhong Li,Yimiao Shao,Sanglu Lu
DOI: https://doi.org/10.1145/3664647.3681664
2024-01-01
Abstract:Human Activity Recognition (HAR) as an emerging research field has attracted widespread academic attention due to its wide range of practical applications in areas such as healthcare, environmental monitoring, and sports training. Given the high cost of annotating sensor data, many unsupervised and semi-supervised methods have been applied to HAR to alleviate the problem of limited data. In this paper, we propose a novel video-enhanced cross-modal collaborative learning method, Vi2ACT, to address the issue of few-shot HAR. We introduce a new data augmentation approach that utilizes a text-to-video generation model to generate class-related videos. Subsequently, a large quantity of video semantic representations are obtained through fine-tuning the video encoder for cross-modal co-learning. Furthermore, to effectively align video semantic representations and time series representations, we enhance HAR at the representation-level using conditional Generative Adversarial Nets (cGAN). We design a novel Representation Conditional Discriminator that is trained to assess samples as originating from video representations rather than those generated by the time series encoder as accurately as possible. We conduct extensive experiments on four commonly used HAR datasets. The experimental results demonstrate that our method outperforms other baseline models in all few-shot scenarios.
What problem does this paper attempt to address?