STFMRPPGFormer:Vision Transformer Network Based on MLP-3D and STFM for Remote Heart Rate Measurement
Yuheng Zhou,Xinhua Liu,Jiaxuan Zuo,Xiaolin Ma,Hailan Kuang
DOI: https://doi.org/10.1109/itnec60942.2024.10733011
2024-01-01
Abstract:Remote photoplethysmography (rPPG), by analyzing changes in skin color in facial videos, offers a promising approach for non-contact heart rate estimation with vast application prospects. Currently, deep learning methods for this task mainly fall into traditional approaches, convolutional neural network (CNN)-based methods, and transformer-based methods. Present-day transformer frameworks typically employ multi-head self-attention mechanisms to simulate global context, which incurs significant computational overhead as image resolution increases. While current convolutional designs offer an alternative, they lack long-range dependency modeling. To address this issue, this paper proposes a vision transformer model for rPPG signal extraction based on Long-Short Distance Attention (LSDA), Spatio-Temporal Focal Modulation (STFM), and MLP-3D named STFMRPPGFormer. We replace traditional multi-head attention mechanisms with STFM, LSDA, and MLP-3D to alleviate computational burden and provide more informative features for the network to learn, enabling better contextual modeling and accurate rPPG signal acquisition. Our model achieves excellent MAE, RMSE values, and the highest Pearson correlation coefficient (r) in experiments on the UBFC-rPPG and PURE datasets. Our approach yields a MAE of 0.39 bpm, RMSE of 0.66 bpm, and an R value of 1.00 Pearson correlation coefficient on the PURE dataset, and a MAE of 0.53 bpm, RMSE of 1.31 bpm, and an R value of 1.00 Pearson correlation coefficient on the UBFC-rPPG dataset, demonstrating the effectiveness of our method.