What Did I Just Hear? Detecting Pornographic Sounds in Adult Videos Using Neural Networks

Holy Lovenia,Dessi Puji Lestari,Rita Frieske
DOI: https://doi.org/10.1145/3561212.3561244
2022-09-08
Abstract:Audio-based pornographic detection enables efficient adult content filtering without sacrificing performance by exploiting distinct spectral characteristics. To improve it, we explore pornographic sound modeling based on different neural architectures and acoustic features. We find that CNN trained on log mel spectrogram achieves the best performance on Pornography-800 dataset. Our experiment results also show that log mel spectrogram allows better representations for the models to recognize pornographic sounds. Finally, to classify whole audio waveforms rather than segments, we employ voting segment-to-audio technique that yields the best audio-level detection results.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to detect pornographic sounds in adult videos through neural networks in order to achieve efficient and accurate adult content filtering. Specifically, the authors explored pornographic sound modeling methods based on different neural network architectures and acoustic features, aiming to improve the performance of pornographic content detection at the audio level. ### Main problems: 1. **Limitations of visual methods**: Most of the existing automatic adult video detection relies on visual classification techniques. These methods are easily affected by image quality (such as lighting, blurring, etc.), and require a large amount of computing resources and storage space. 2. **Advantages and disadvantages of audio methods**: In contrast, audio - based methods have lower computing and storage requirements when dealing with pornographic content detection, and can use unique spectral features to distinguish between pornographic and non - pornographic audio. However, there is relatively little research in this area, especially in the application of deep learning. ### Research objectives: - Explore and compare the performance of different neural network architectures (such as fully - connected neural network FFNN and convolutional neural network CNN) in pornographic sound detection. - Evaluate the impact of different acoustic features (such as MFCCs and log - mel spectrograms) on model performance. - Propose and evaluate methods for converting segment - level predictions to audio - level predictions to achieve more effective overall audio classification. ### Core contributions: - It was found that the CNN trained with log - mel spectrograms achieved the best segment - level and audio - level detection performance on the Pornography - 800 dataset. - The voting method was proposed as the most effective segment - to - audio prediction conversion method, which further improved the accuracy of audio - level detection. ### Formula representation: The formulas involved in the paper mainly include those for audio feature extraction and model training. For example: - The calculation formula for the log - mel spectrogram: \[ S_{\text{log - mel}}=\log(1 + 1000\cdot|STFT(x)|^{2}) \] where \(x\) is the audio signal and \(STFT(x)\) is the result of the short - time Fourier transform. - The binary cross - entropy loss function used in the model training process: \[ L(y,\hat{y})=-\frac{1}{N}\sum_{i = 1}^{N}\left[y_{i}\log(\hat{y}_{i})+(1 - y_{i})\log(1 - \hat{y}_{i})\right] \] where \(y\) is the true label, \(\hat{y}\) is the predicted probability, and \(N\) is the number of samples. Through these methods and analyses, the authors have successfully improved the performance of pornographic audio detection and provided valuable references for future research.