Abstract:This paper introduces a modeling approach that employs multi-level global processing, encompassing both short-term frame-level and long-term sample-level feature scales. In the initial stage of shallow feature extraction, various scales are employed to extract multi-level features, including Mel-Frequency Cepstral Coefficients (MFCC) and pre-Fbank log energy spectrum. The construction of the identification network model involves considering the input two-dimensional temporal features from both frame and sample levels. Specifically, the model initially employs one-dimensional convolution-based Convolutional Long Short-Term Memory (ConvLSTM) to fuse spatiotemporal information and extract short-term frame-level features. Subsequently, bidirectional long Short-Term Memory (BiLSTM) is utilized to learn long-term sample-level sequential representations. The transformer encoder then performs cross-scale, multi-level processing on global frame-level and sample-level features, facilitating deep feature representation and fusion at both levels. Finally, recognition results are obtained through Softmax. Our method achieves an impressive 99.6% recognition accuracy on the CCNU_Mobile dataset, exhibiting a notable improvement of 2% to 12% compared to the baseline system. Additionally, we thoroughly investigate the transferability of our model, achieving an 87.9% accuracy in a classification task on a new dataset.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is **Mobile Recording Device Identification**. Specifically, the paper aims to develop a method that can effectively identify the mobile recording device from which an audio file originates. This problem is of great significance in judicial evidence collection and personal intellectual property protection. ### Problem Background With the wide application of digital media (such as audio and video) in social, communication and media dissemination, information dissemination has become more convenient. However, the characteristics of digital media files that are easy to edit and modify also make it easy to maliciously tamper with and forge audio files, thus causing information security problems and increasing related risks. Therefore, the source identification technology of mobile recording devices has become crucial. ### Limitations of Existing Research Existing research mainly focuses on in - depth analysis of input features using a single network module, usually only focusing on frame - level features, while ignoring the integration of sample - level features and global temporal information. This limitation results in an incomplete feature representation and affects the recognition performance. ### Solutions in the Paper To solve the above problems, this paper proposes a method based on cross - scale and multi - level representation learning, which specifically includes the following aspects: 1. **Front - End Feature Extraction**: - Through a variety of shallow feature extraction methods, multi - scale features are extracted from the original audio, including Mel - Frequency Cepstral Coefficients (MFCC) and pre - Fbank log - energy spectra. - These features integrate temporal and spatial representations and provide more abundant information. 2. **Back - End Recognition Network Model**: - Use the Convolutional Long - Short - Term Memory Network (ConvLSTM) based on one - dimensional convolution to fuse spatio - temporal information and extract short - time frame - level features. - Use the Bidirectional Long - Short - Term Memory Network (BiLSTM) to learn long - term sample - level sequence representations. - Through the Transformer encoder, cross - scale and multi - level processing are carried out on global frame - level and sample - level features to achieve deep feature representation and fusion. 3. **Global Feature Fusion**: - Through the Transformer encoder, information at different scales (frame - level, sample - level and cross - scale) is interactively processed to provide better embedding representations, thereby improving the final recognition performance. ### Experimental Results This method has achieved a recognition accuracy rate of 99.6% on the CCNU Mobile dataset, which is 2% - 12% higher than the baseline system. In addition, by fine - tuning the pre - trained model and conducting small - batch sample transfer training on a new dataset, this method has also achieved an accuracy rate of 87.9% in new classification tasks. ### Summary The main contribution of this paper is that through multi - scale and multi - level feature extraction and fusion, the accuracy of mobile recording device source identification is significantly improved. This method not only performs well on existing datasets but also shows good transferability.

Mobile Recording Device Recognition Based Cross-Scale and Multi-Level Representation Learning

Vehicle Behavior Recognition using Multi-Stream 3D Convolutional Neural Network

Spatio-temporal representation learning enhanced source cell-phone recognition from speech recordings

End-to-end Recording Device Identification Based on Deep Representation Learning

Pedestrian Recognition in Multi-Camera Networks Using Multilevel Important Salient Feature and Multicategory Incremental Learning.

Cross-Scene Building Identification Based on Dual-Stream Neural Network and Efficient Channel Attention Mechanism

Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms

Multi-layer feature fusion and attention enhancement for fine-grained vehicle recognition research

Micro-expression recognition based on multi-scale 3D residual convolutional neural network

Two-Level Spatio-Temporal Feature Fused Two-Stream Network for Micro-Expression Recognition

A Framework of Combining Short-Term Spatial/Frequency Feature Extraction and Long-Term IndRNN for Activity Recognition

Multi-channel Capsule Network for Micro-expression Recognition with Multiscale Fusion

A multi-scale feature extraction fusion model for human activity recognition

Audio source recording device recognition based on representation learning of sequential Gaussian mean matrix

Multi-Scale Deep Feature Fusion for Vehicle Re-Identification

Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling

Enhanced Hybrid Vision Transformer with Multi-Scale Feature Integration and Patch Dropping for Facial Expression Recognition

Facial Micro-Expression Recognition Based on Multi-Scale Temporal and Spatial Features

Enhancing Automatic Modulation Recognition through Robust Global Feature Extraction

CD-CNN: A Partially Supervised Cross-Domain Deep Learning Model for Urban Resident Recognition

CSCNN: Lightweight Modulation Recognition Model for Mobile Multimedia Intelligent Information Processing