Self-supervised Representation Learning Framework for Remote Physiological Measurement Using Spatiotemporal Augmentation Loss

Hao Wang,Euijoon Ahn,Jinman Kim
DOI: https://doi.org/10.48550/arXiv.2107.07695
2021-12-14
Abstract:Recent advances in supervised deep learning methods are enabling remote measurements of photoplethysmography-based physiological signals using facial videos. The performance of these supervised methods, however, are dependent on the availability of large labelled data. Contrastive learning as a self-supervised method has recently achieved state-of-the-art performances in learning representative data features by maximising mutual information between different augmented views. However, existing data augmentation techniques for contrastive learning are not designed to learn physiological signals from videos and often fail when there are complicated noise and subtle and periodic colour or shape variations between video frames. To address these problems, we present a novel self-supervised spatiotemporal learning framework for remote physiological signal representation learning, where there is a lack of labelled training data. Firstly, we propose a landmark-based spatial augmentation that splits the face into several informative parts based on the Shafer dichromatic reflection model to characterise subtle skin colour fluctuations. We also formulate a sparsity-based temporal augmentation exploiting Nyquist-Shannon sampling theorem to effectively capture periodic temporal changes by modelling physiological signal features. Furthermore, we introduce a constrained spatiotemporal loss which generates pseudo-labels for augmented video clips. It is used to regulate the training process and handle complicated noise. We evaluated our framework on 3 public datasets and demonstrated superior performances than other self-supervised methods and achieved competitive accuracy compared to the state-of-the-art supervised methods.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the absence of large - scale labeled data, how to accurately extract physiological signals (such as heart rate, respiratory rate, etc.) from facial videos through a self - supervised learning framework. Specifically, the author proposes a new self - supervised spatio - temporal learning framework (SLF - RPM), aiming to overcome the limitations of existing methods that rely on large amounts of labeled data and effectively deal with complex noise and subtle color/shape changes. ### Main problems and background of the paper 1. **Limitations of existing methods**: - **Supervised learning methods**: Although supervised deep - learning methods have made significant progress in remote photoplethysmography (rPPG) physiological signal measurement, they rely on large - scale labeled data. Obtaining such data is both expensive and time - consuming and requires medical equipment. - **Standard data augmentation techniques**: Existing contrast - learning methods use standard data augmentation techniques (such as frame cropping, resizing, color jittering, etc.), but these techniques are not specifically optimized for the extraction of physiological signals, especially performing poorly when dealing with complex noise and subtle color fluctuations. 2. **Research motivation**: - To overcome the above limitations, the author proposes a new self - supervised spatio - temporal learning framework (SLF - RPM), which can effectively learn physiological signal features without large - scale labeled data. - This framework captures subtle color changes in facial videos by introducing landmark - based spatial augmentation and sparsity - based temporal augmentation, and improves the robustness and accuracy of the model through pseudo - label - constrained training processes. ### Overview of the solution - **Landmark - based spatial augmentation**: According to Shafer's dichromatic reflection model, the face is divided into multiple information - rich regions to capture subtle fluctuations in skin color. - **Sparsity - based temporal augmentation**: Using the Nyquist - Shannon sampling theorem, the video is temporally augmented with different step sizes to effectively capture periodic color changes. - **Pseudo - label - constrained spatio - temporal loss function**: Generate pseudo - labels and use them for auxiliary classification tasks to regulate the contrast - learning process and deal with complex noise. ### Experimental results The author evaluated the proposed framework on three public datasets and demonstrated its superior performance: - **Linear classification evaluation**: Compared with existing self - supervised learning methods, SLF - RPM shows the best performance on all datasets. - **Transfer learning evaluation**: Through pre - training and fine - tuning, SLF - RPM significantly improves the accuracy of heart rate estimation, especially having stronger adaptability on limited datasets. - **Ablation experiment**: Verified the effectiveness of landmark - based spatial augmentation and sparsity - based temporal augmentation, indicating that these augmentation strategies are significantly better than standard data augmentation techniques. In conclusion, this paper proposes an innovative self - supervised learning framework that can effectively solve the problem of lack of labeled data in remote physiological signal measurement and shows excellent performance on multiple datasets.