Abstract:BACKGROUND: A daily activity routine is vital for overall health and well-being, supporting physical and mental fitness. Consistent physical activity is linked to a multitude of benefits for the body, mind, and emotions, playing a key role in raising a healthy lifestyle. The use of wearable devices has become essential in the realm of health and fitness, facilitating the monitoring of daily activities. While convolutional neural networks (CNN) have proven effective, challenges remain in quickly adapting to a variety of activities. OBJECTIVE: This study aimed to develop a model for precise recognition of human activities to revolutionize health monitoring by integrating transformer models with multi-head attention for precise human activity recognition using wearable devices. METHODS: The Human Activity Recognition (HAR) algorithm uses deep learning to classify human activities using spectrogram data. It uses a pretrained convolution neural network (CNN) with a MobileNetV2 model to extract features, a dense residual transformer network (DRTN), and a multi-head multi-level attention architecture (MH-MLA) to capture time-related patterns. The model then blends information from both layers through an adaptive attention mechanism and uses a SoftMax function to provide classification probabilities for various human activities. RESULTS: The integrated approach, combining pretrained CNN with transformer models to create a thorough and effective system for recognizing human activities from spectrogram data, outperformed these methods in various datasets – HARTH, KU-HAR, and HuGaDB produced accuracies of 92.81%, 97.98%, and 95.32%, respectively. This suggests that the integration of diverse methodologies yields good results in capturing nuanced human activities across different activities. The comparison analysis showed that the integrated system consistently performs better for dynamic human activity recognition datasets. CONCLUSION: In conclusion, maintaining a routine of daily activities is crucial for overall health and well-being. Regular physical activity contributes substantially to a healthy lifestyle, benefiting both the body and the mind. The integration of wearable devices has simplified the monitoring of daily routines. This research introduces an innovative approach to human activity recognition, combining the CNN model with a dense residual transformer network (DRTN) with multi-head multi-level attention (MH-MLA) within the transformer architecture to enhance its capability.

ViT-ReT: Vision and Recurrent Transformer Neural Networks for Human Activity Recognition in Videos

A human activity recognition method based on Vision Transformer

RNNs, CNNs and Transformers in Human Action Recognition: A Survey and a Hybrid Model

Vision Transformer and Deep Sequence Learning for Human Activity Recognition in Surveillance Videos

Video-Based Human Activity Recognition Using Deep Learning Approaches

Modeling transformer architecture with attention layer for human activity recognition

Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data?

Human Activity Recognition Based on Deep-Temporal Learning Using Convolution Neural Networks Features and Bidirectional Gated Recurrent Unit With Features Selection

MgMViT: Multi-Granularity and Multi-Scale Vision Transformer for Efficient Action Recognition

Video Person Re-identification Based on Transformer-CNN Model

Visualization As Intermediate Representations (VLAIR) for Human Activity Recognition.

Suspicious activities detection using spatial–temporal features based on vision transformer and recurrent neural network

A Comprehensive Review of Recent Deep Learning Techniques for Human Activity Recognition

Revolutionizing health monitoring: Integrating transformer models with multi-head attention for precise human activity recognition using wearable devices

A lightweight hybrid vision transformer network for radar-based human activity recognition

Human action recognition using attention based LSTM network with dilated CNN features

A Video Is Worth Three Views: Trigeminal Transformers for Video-Based Person Re-Identification

Distilling Knowledge from CNN-Transformer Models for Enhanced Human Action Recognition

Rethinking Vision Transformer Through Human–object Interaction Detection

LS-VIT: Vision Transformer for action recognition based on long and short-term temporal difference

ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos