End-to-End Bloody Video Recognition by Audio-Visual Feature Fusion.

Congcong Hou,Xiaoyu Wu,Ge Wang
DOI: https://doi.org/10.1007/978-3-030-03398-9_43
2018-01-01
Abstract:With the rapid development of Internet technology, the spread of bloody video has become increasingly serious, causing huge harm to society. In this paper, a bloody video recognition method based on audio-visual feature fusion is proposed to complement the limitation of the single vision-modality methods. In the absence of open bloody video data, this paper first constructed a database of bloody videos through web crawlers and data augmentation methods; then it used CNN and LSTM methods to extract the spatiotemporal features of visual channels. Meanwhile, the audio channel features were extracted directly from the original waveforms using the 1D convolutional network. Finally, the neural network based on the audio-visual feature fusion layer was constructed to achieve the early fusion of multimodal cues. The accuracy of the proposed method on the bloody video test data is 95%. The experimental results on self-built bloody video databases demonstrate that the extracted audio-visual feature representations are effective and the proposed multimodal fusion model can obtain the better and discriminative recognition performance than the single-channel model.
What problem does this paper attempt to address?