Mobile Recording Device Recognition Based Cross-Scale and Multi-Level Representation Learning

Chunyan Zeng,Yuhao Zhao,Zhifeng Wang
2024-11-06
Abstract:This paper introduces a modeling approach that employs multi-level global processing, encompassing both short-term frame-level and long-term sample-level feature scales. In the initial stage of shallow feature extraction, various scales are employed to extract multi-level features, including Mel-Frequency Cepstral Coefficients (MFCC) and pre-Fbank log energy spectrum. The construction of the identification network model involves considering the input two-dimensional temporal features from both frame and sample levels. Specifically, the model initially employs one-dimensional convolution-based Convolutional Long Short-Term Memory (ConvLSTM) to fuse spatiotemporal information and extract short-term frame-level features. Subsequently, bidirectional long Short-Term Memory (BiLSTM) is utilized to learn long-term sample-level sequential representations. The transformer encoder then performs cross-scale, multi-level processing on global frame-level and sample-level features, facilitating deep feature representation and fusion at both levels. Finally, recognition results are obtained through Softmax. Our method achieves an impressive 99.6% recognition accuracy on the CCNU_Mobile dataset, exhibiting a notable improvement of 2% to 12% compared to the baseline system. Additionally, we thoroughly investigate the transferability of our model, achieving an 87.9% accuracy in a classification task on a new dataset.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **Mobile Recording Device Identification**. Specifically, the paper aims to develop a method that can effectively identify the mobile recording device from which an audio file originates. This problem is of great significance in judicial evidence collection and personal intellectual property protection. ### Problem Background With the wide application of digital media (such as audio and video) in social, communication and media dissemination, information dissemination has become more convenient. However, the characteristics of digital media files that are easy to edit and modify also make it easy to maliciously tamper with and forge audio files, thus causing information security problems and increasing related risks. Therefore, the source identification technology of mobile recording devices has become crucial. ### Limitations of Existing Research Existing research mainly focuses on in - depth analysis of input features using a single network module, usually only focusing on frame - level features, while ignoring the integration of sample - level features and global temporal information. This limitation results in an incomplete feature representation and affects the recognition performance. ### Solutions in the Paper To solve the above problems, this paper proposes a method based on cross - scale and multi - level representation learning, which specifically includes the following aspects: 1. **Front - End Feature Extraction**: - Through a variety of shallow feature extraction methods, multi - scale features are extracted from the original audio, including Mel - Frequency Cepstral Coefficients (MFCC) and pre - Fbank log - energy spectra. - These features integrate temporal and spatial representations and provide more abundant information. 2. **Back - End Recognition Network Model**: - Use the Convolutional Long - Short - Term Memory Network (ConvLSTM) based on one - dimensional convolution to fuse spatio - temporal information and extract short - time frame - level features. - Use the Bidirectional Long - Short - Term Memory Network (BiLSTM) to learn long - term sample - level sequence representations. - Through the Transformer encoder, cross - scale and multi - level processing are carried out on global frame - level and sample - level features to achieve deep feature representation and fusion. 3. **Global Feature Fusion**: - Through the Transformer encoder, information at different scales (frame - level, sample - level and cross - scale) is interactively processed to provide better embedding representations, thereby improving the final recognition performance. ### Experimental Results This method has achieved a recognition accuracy rate of 99.6% on the CCNU Mobile dataset, which is 2% - 12% higher than the baseline system. In addition, by fine - tuning the pre - trained model and conducting small - batch sample transfer training on a new dataset, this method has also achieved an accuracy rate of 87.9% in new classification tasks. ### Summary The main contribution of this paper is that through multi - scale and multi - level feature extraction and fusion, the accuracy of mobile recording device source identification is significantly improved. This method not only performs well on existing datasets but also shows good transferability.