Deep Learning for Activity Recognition Using Audio and Video

Francisco Reinolds,Cristiana Neto,José Machado
DOI: https://doi.org/10.3390/electronics11050782
IF: 2.9
2022-03-03
Electronics
Abstract:Neural networks have established themselves as powerhouses in what concerns several types of detection, ranging from human activities to their emotions. Several types of analysis exist, and the most popular and successful is video. However, there are other kinds of analysis, which, despite not being used as often, are still promising. In this article, a comparison between audio and video analysis is drawn in an attempt to classify violence detection in real-time streams. This study, which followed the CRISP-DM methodology, made use of several models available through PyTorch in order to test a diverse set of models and achieve robust results. The results obtained proved why video analysis has such prevalence, with the video classification handily outperforming its audio classification counterpart. Whilst the audio models attained on average 76% accuracy, video models secured average scores of 89%, showing a significant difference in performance. This study concluded that the applied methods are quite promising in detecting violence, using both audio and video.
engineering, electrical & electronic,computer science, information systems,physics, applied
What problem does this paper attempt to address?