Human activity recognition using deep learning approaches and single frame cnn and convolutional lstm

Sheryl Mathew,Annapoorani Subramanian,Pooja,Balamurugan MS,Manoj Kumar Rajagopal
2023-04-18
Abstract:Human activity recognition is one of the most important tasks in computer vision and has proved useful in different fields such as healthcare, sports training and security. There are a number of approaches that have been explored to solve this task, some of them involving sensor data, and some involving video data. In this paper, we aim to explore two deep learning-based approaches, namely single frame Convolutional Neural Networks (CNNs) and convolutional Long Short-Term Memory to recognise human actions from videos. Using a convolutional neural networks-based method is advantageous as CNNs can extract features automatically and Long Short-Term Memory networks are great when it comes to working on sequence data such as video. The two models were trained and evaluated on a benchmark action recognition dataset, UCF50, and another dataset that was created for the experimentation. Though both models exhibit good accuracies, the single frame CNN model outperforms the Convolutional LSTM model by having an accuracy of 99.8% with the UCF50 dataset.
Computer Vision and Pattern Recognition,Image and Video Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **Human Activity Recognition (HAR)**, especially how to accurately recognize human actions from video data. Specifically, the author explored two deep - learning - based methods: Single Frame Convolutional Neural Networks (CNN) and Convolutional Long - Short - Term Memory (ConvLSTM) to achieve the recognition of human actions in videos. ### Problem Background Human Activity Recognition is a very important task in the field of computer vision and is widely used in many fields such as healthcare, sports training, entertainment, robotics, and security management. Traditional HAR methods include sensor - based data and video - based data. Sensor - based methods require the detected object to wear a wearable device, which may be difficult to achieve in some scenarios (such as the home environment). Video - based methods, on the other hand, capture videos through cameras and analyze the actions in them, which are more suitable for scenarios where wearing sensors is not appropriate. ### Research Objectives The main objectives of this paper are: 1. **Explore two deep - learning methods**: Single - frame CNN and ConvLSTM for recognizing human activities from videos. 2. **Evaluate the performance of these two models**: Use two different datasets for training and testing, namely the benchmark dataset UCF50 and the dataset created by the author himself. 3. **Compare the performance of the two models**: Determine which model performs better in the human activity recognition task through the test accuracy. ### Datasets - **UCF50**: This is a standard action recognition dataset that contains 50 different categories of actions, with an average of 133 video clips per category. These videos are from YouTube and have complex backgrounds and diverse shooting conditions. - **Self - built dataset**: The author created a dataset containing three basic activities (walking, sitting, jumping), mainly for videos of Indian faces, with 21 video clips in each category and a duration of 3 to 15 seconds. ### Method Overview - **Single - frame CNN**: This method classifies each frame in the video and then takes the average of the probability results of all frames to obtain the action category of the entire video. CNN can automatically extract the spatial features of the image and classify them through multiple convolutional layers, pooling layers, and fully - connected layers. - **ConvLSTM**: This method combines the spatial feature extraction ability of CNN and the time - series processing ability of LSTM and can capture both the spatial and temporal features in the video. ConvLSTM is particularly suitable for processing data with spatio - temporal structures, such as videos. ### Results The experimental results show that the single - frame CNN model outperforms the ConvLSTM model on the UCF50 dataset, achieving an accuracy of 99.8%. And on the self - built dataset by the author, although the sample size is small, the single - frame CNN still performs better. This indicates that for simple action recognition tasks, the single - frame CNN may be effective enough, while ConvLSTM is more suitable for processing more complex time - series data. ### Summary This paper aims to explore the performance of two deep - learning methods in the human activity recognition task by comparing them and provides a valuable reference for future HAR research.