Abstract:Human activity recognition is one of the most important tasks in computer vision and has proved useful in different fields such as healthcare, sports training and security. There are a number of approaches that have been explored to solve this task, some of them involving sensor data, and some involving video data. In this paper, we aim to explore two deep learning-based approaches, namely single frame Convolutional Neural Networks (CNNs) and convolutional Long Short-Term Memory to recognise human actions from videos. Using a convolutional neural networks-based method is advantageous as CNNs can extract features automatically and Long Short-Term Memory networks are great when it comes to working on sequence data such as video. The two models were trained and evaluated on a benchmark action recognition dataset, UCF50, and another dataset that was created for the experimentation. Though both models exhibit good accuracies, the single frame CNN model outperforms the Convolutional LSTM model by having an accuracy of 99.8% with the UCF50 dataset.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is **Human Activity Recognition (HAR)**, especially how to accurately recognize human actions from video data. Specifically, the author explored two deep - learning - based methods: Single Frame Convolutional Neural Networks (CNN) and Convolutional Long - Short - Term Memory (ConvLSTM) to achieve the recognition of human actions in videos. ### Problem Background Human Activity Recognition is a very important task in the field of computer vision and is widely used in many fields such as healthcare, sports training, entertainment, robotics, and security management. Traditional HAR methods include sensor - based data and video - based data. Sensor - based methods require the detected object to wear a wearable device, which may be difficult to achieve in some scenarios (such as the home environment). Video - based methods, on the other hand, capture videos through cameras and analyze the actions in them, which are more suitable for scenarios where wearing sensors is not appropriate. ### Research Objectives The main objectives of this paper are: 1. **Explore two deep - learning methods**: Single - frame CNN and ConvLSTM for recognizing human activities from videos. 2. **Evaluate the performance of these two models**: Use two different datasets for training and testing, namely the benchmark dataset UCF50 and the dataset created by the author himself. 3. **Compare the performance of the two models**: Determine which model performs better in the human activity recognition task through the test accuracy. ### Datasets - **UCF50**: This is a standard action recognition dataset that contains 50 different categories of actions, with an average of 133 video clips per category. These videos are from YouTube and have complex backgrounds and diverse shooting conditions. - **Self - built dataset**: The author created a dataset containing three basic activities (walking, sitting, jumping), mainly for videos of Indian faces, with 21 video clips in each category and a duration of 3 to 15 seconds. ### Method Overview - **Single - frame CNN**: This method classifies each frame in the video and then takes the average of the probability results of all frames to obtain the action category of the entire video. CNN can automatically extract the spatial features of the image and classify them through multiple convolutional layers, pooling layers, and fully - connected layers. - **ConvLSTM**: This method combines the spatial feature extraction ability of CNN and the time - series processing ability of LSTM and can capture both the spatial and temporal features in the video. ConvLSTM is particularly suitable for processing data with spatio - temporal structures, such as videos. ### Results The experimental results show that the single - frame CNN model outperforms the ConvLSTM model on the UCF50 dataset, achieving an accuracy of 99.8%. And on the self - built dataset by the author, although the sample size is small, the single - frame CNN still performs better. This indicates that for simple action recognition tasks, the single - frame CNN may be effective enough, while ConvLSTM is more suitable for processing more complex time - series data. ### Summary This paper aims to explore the performance of two deep - learning methods in the human activity recognition task by comparing them and provides a valuable reference for future HAR research.

Human activity recognition using deep learning approaches and single frame cnn and convolutional lstm

Human Action Recognition From Digital Videos Based on Deep Learning.

Human Action Recognition Using Deep Learning Methods.

A human activity recognition framework in videos using segmented human subject focus

Human action recognition using attention based LSTM network with dilated CNN features

LSTM-Based Neural Network to Recognize Human Activities Using Deep Learning Techniques

A recent survey for human activity recoginition based on deep learning approach

A resource conscious human action recognition framework using 26-layered deep convolutional neural network

Modelling Human Body Pose for Action Recognition Using Deep Neural Networks

A Close Look into Human Activity Recognition Models using Deep Learning

Human Activity Recognition using Resnet-34 Model

Human Action Recognition in Videos using Convolution Long Short-Term Memory Network with Spatio-Temporal Networks

Vision Transformer and Deep Sequence Learning for Human Activity Recognition in Surveillance Videos

Human Activity Recognition Based on Deep-Temporal Learning Using Convolution Neural Networks Features and Bidirectional Gated Recurrent Unit With Features Selection

1-DCNN with Stacked LSTM Architecture for Human Activity Recognition Using Wearable Sensing Data

Human Activity Recognition Based On Video Summarization And Deep Convolutional Neural Network

An Efficient and Lightweight Deep Learning Model for Human Activity Recognition Using Smartphones

Human Activity Recognition with Convolutional Neural Netowrks

Multi-view Multi-modal Approach Based on 5S-CNN and BiLSTM Using Skeleton, Depth and RGB Data for Human Activity Recognition

A Novel Multi-Stage Training Approach for Human Activity Recognition From Multimodal Wearable Sensor Data Using Deep Neural Network