Abstract:Sensing human motions through Inertial Measurement Units (IMUs) embedded in personal devices has enabled significant applications in health and wellness. While labeled IMU data is scarce, we can collect unlabeled or weakly labeled IMU data to model human motions. For video or text modalities, the "pretrain and adapt" approach utilizes large volumes of unlabeled or weakly labeled data for pretraining, building a strong feature extractor, followed by adaptation to specific tasks using limited labeled data. This approach has not been widely adopted in the IMU domain for two reasons: (1) pretraining methods are poorly understood in the context of IMU, and (2) open-source pretrained models that generalize across datasets are rarely publicly available. In this paper, we aim to address the first issue by proposing PRIMUS, a method for PRetraining IMU encoderS. We conduct a systematic and unified evaluation of various self-supervised and multimodal learning pretraining objectives. Our findings indicate that using PRIMUS, which combines self-supervision, multimodal supervision, and nearest-neighbor supervision, can significantly enhance downstream performance. With fewer than 500 labeled samples per class, PRIMUS effectively enhances downstream performance by up to 15% in held-out test data, compared to the state-of-the-art multimodal training method. To benefit the broader community, our code and pre-trained IMU encoders will be made publicly available at <a class="link-external link-http" href="http://github.com/nokia-bell-labs" rel="external noopener nofollow">this http URL</a> upon publication.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges faced by inertial measurement unit (IMU) data in the pre - training and adaptation stages, especially in the case of scarce labeled data. Specifically, the paper aims to solve the following problems by proposing a method named PRIMUS: 1. **Lack of pre - training methods**: There are relatively few studies on pre - training methods in the IMU field, especially in terms of how to effectively utilize a large amount of unlabeled or weakly - labeled data. 2. **Lack of general pre - training models**: Open - source pre - training IMU models that can be generalized on different datasets are rare. To solve these problems, PRIMUS proposes a multi - objective pre - training strategy that combines self - supervision learning (SS), multimodal supervision (MM), and nearest - neighbor supervision (NN). The specific functions of these methods are as follows: - **Self - supervision learning (SS)**: Ensure that the IMU encoder is invariant to noise, that is, it remains robust to small changes in sensor position or type. \[ L_{SS}(B)=\sum_{i = 1}^{n}\frac{\exp\left(\frac{I(m_{i})\cdot I(h(m_{i}))}{\tau}\right)}{\sum_{k = 1}^{n}\exp\left(\frac{I(m_{i})\cdot I(h(m_{k}))}{\tau}\right)} \] - **Multimodal supervision (MM)**: Push the IMU representation towards alignment with text and video representations, so that the IMU encoder can learn rich semantic information in other modalities. \[ L_{m2v}(B)=\sum_{i = 1}^{n}\frac{\exp\left(\frac{I(m_{i})\cdot V(v_{i})}{\tau}\right)}{\sum_{j = 1}^{n}\exp\left(\frac{I(m_{i})\cdot V(v_{j})}{\tau}\right)} \] \[ L_{m2t}(B)=\sum_{i = 1}^{n}\frac{\exp\left(\frac{I(m_{i})\cdot T(t_{i})}{\tau}\right)}{\sum_{j = 1}^{n}\exp\left(\frac{I(m_{i})\cdot T(t_{j})}{\tau}\right)} \] where \(L_{MM}(B)=L_{m2v}(B)+L_{m2t}(B)\) - **Nearest - neighbor supervision (NN)**: Increase the diversity of supervision through the nearest - neighbor retrieval mechanism and use natural data similarity for more flexible contrastive learning. \[ L_{NN}(B)=\sum_{\text{mod}\in\{m, v, t\}}\sum_{i = 1}^{n}\frac{\exp\left(\frac{I(m_{i})\cdot z^{\text{mod}}_{\eta(i)}}{\tau}\right)}{\sum_{j = 1}^{n}\exp\left(\frac{I(m_{i})\cdot z^{\text{mod}}_{\eta(j)}}{\tau}\right)} \] The final multi - objective loss function is: \[ L(B)=\alpha L_{SS}(B)+\beta L_{MM}(B)+\gamma L_{NN}(B) \] In this way, PRIMUS can significantly improve the performance of downstream tasks with only a small amount of labeled data, especially in the few - shot learning scenario.

PRIMUS: Pretraining IMU Encoders with Multimodal Self-Supervision

IMG2IMU: Translating Knowledge from Large-Scale Images to IMU Sensing Applications

Enhancing Inertial Hand based HAR through Joint Representation of Language, Pose and Synthetic IMUs

Learning Shared RGB-D Fields: Unified Self-supervised Pre-training for Label-efficient LiDAR-Camera 3D Perception

IMU Preintegrated Features for Efficient Deep Inertial Odometry

Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition

APE: Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations

IMUOptimize: A Data-Driven Approach to Optimal IMU Placement for Human Pose Estimation with Transformer Architecture

Robot Learning with Sensorimotor Pre-training

Multimodal Autoregressive Pre-training of Large Vision Encoders

PTUM: Pre-training User Model from Unlabeled User Behaviors Via Self-supervision.

Self-supervised Pretraining and Finetuning for Monocular Depth and Visual Odometry

Promoting cross-modal representations to improve multimodal foundation models for physiological signals

ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding

CROMOSim: A Deep Learning-based Cross-modality Inertial Measurement Simulator

Towards All-in-one Pre-training Via Maximizing Multi-modal Mutual Information

An Examination of Wearable Sensors and Video Data Capture for Human Exercise Classification

UNIMO-2: End-to-End Unified Vision-Language Grounded Learning

SimIPU: Simple 2D Image and 3D Point Cloud Unsupervised Pre-Training for Spatial-Aware Visual Representations

SelfPose3d: Self-Supervised Multi-Person Multi-View 3d Pose Estimation

Contrastive Left-Right Wearable Sensors (IMUs) Consistency Matching for HAR