PRIMUS: Pretraining IMU Encoders with Multimodal Self-Supervision

Arnav M. Das,Chi Ian Tang,Fahim Kawsar,Mohammad Malekzadeh
2024-11-23
Abstract:Sensing human motions through Inertial Measurement Units (IMUs) embedded in personal devices has enabled significant applications in health and wellness. While labeled IMU data is scarce, we can collect unlabeled or weakly labeled IMU data to model human motions. For video or text modalities, the "pretrain and adapt" approach utilizes large volumes of unlabeled or weakly labeled data for pretraining, building a strong feature extractor, followed by adaptation to specific tasks using limited labeled data. This approach has not been widely adopted in the IMU domain for two reasons: (1) pretraining methods are poorly understood in the context of IMU, and (2) open-source pretrained models that generalize across datasets are rarely publicly available. In this paper, we aim to address the first issue by proposing PRIMUS, a method for PRetraining IMU encoderS. We conduct a systematic and unified evaluation of various self-supervised and multimodal learning pretraining objectives. Our findings indicate that using PRIMUS, which combines self-supervision, multimodal supervision, and nearest-neighbor supervision, can significantly enhance downstream performance. With fewer than 500 labeled samples per class, PRIMUS effectively enhances downstream performance by up to 15% in held-out test data, compared to the state-of-the-art multimodal training method. To benefit the broader community, our code and pre-trained IMU encoders will be made publicly available at <a class="link-external link-http" href="http://github.com/nokia-bell-labs" rel="external noopener nofollow">this http URL</a> upon publication.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges faced by inertial measurement unit (IMU) data in the pre - training and adaptation stages, especially in the case of scarce labeled data. Specifically, the paper aims to solve the following problems by proposing a method named PRIMUS: 1. **Lack of pre - training methods**: There are relatively few studies on pre - training methods in the IMU field, especially in terms of how to effectively utilize a large amount of unlabeled or weakly - labeled data. 2. **Lack of general pre - training models**: Open - source pre - training IMU models that can be generalized on different datasets are rare. To solve these problems, PRIMUS proposes a multi - objective pre - training strategy that combines self - supervision learning (SS), multimodal supervision (MM), and nearest - neighbor supervision (NN). The specific functions of these methods are as follows: - **Self - supervision learning (SS)**: Ensure that the IMU encoder is invariant to noise, that is, it remains robust to small changes in sensor position or type. \[ L_{SS}(B)=\sum_{i = 1}^{n}\frac{\exp\left(\frac{I(m_{i})\cdot I(h(m_{i}))}{\tau}\right)}{\sum_{k = 1}^{n}\exp\left(\frac{I(m_{i})\cdot I(h(m_{k}))}{\tau}\right)} \] - **Multimodal supervision (MM)**: Push the IMU representation towards alignment with text and video representations, so that the IMU encoder can learn rich semantic information in other modalities. \[ L_{m2v}(B)=\sum_{i = 1}^{n}\frac{\exp\left(\frac{I(m_{i})\cdot V(v_{i})}{\tau}\right)}{\sum_{j = 1}^{n}\exp\left(\frac{I(m_{i})\cdot V(v_{j})}{\tau}\right)} \] \[ L_{m2t}(B)=\sum_{i = 1}^{n}\frac{\exp\left(\frac{I(m_{i})\cdot T(t_{i})}{\tau}\right)}{\sum_{j = 1}^{n}\exp\left(\frac{I(m_{i})\cdot T(t_{j})}{\tau}\right)} \] where \(L_{MM}(B)=L_{m2v}(B)+L_{m2t}(B)\) - **Nearest - neighbor supervision (NN)**: Increase the diversity of supervision through the nearest - neighbor retrieval mechanism and use natural data similarity for more flexible contrastive learning. \[ L_{NN}(B)=\sum_{\text{mod}\in\{m, v, t\}}\sum_{i = 1}^{n}\frac{\exp\left(\frac{I(m_{i})\cdot z^{\text{mod}}_{\eta(i)}}{\tau}\right)}{\sum_{j = 1}^{n}\exp\left(\frac{I(m_{i})\cdot z^{\text{mod}}_{\eta(j)}}{\tau}\right)} \] The final multi - objective loss function is: \[ L(B)=\alpha L_{SS}(B)+\beta L_{MM}(B)+\gamma L_{NN}(B) \] In this way, PRIMUS can significantly improve the performance of downstream tasks with only a small amount of labeled data, especially in the few - shot learning scenario.