STFF-SM: Steganalysis Model Based on Spatial and Temporal Feature Fusion for Speech Streams

Hui Tian,Yiqin Qiu,Wojciech Mazurczyk,Haizhou Li,Zhenxing Qian
DOI: https://doi.org/10.1109/taslp.2022.3224295
2023-01-01
Abstract:The real-time detection of speech steganography in Voice-over-Internet-Protocol (VoIP) scenarios remains an open problem, as it requires steganalysis methods to perform for low-intensity embeddings and short-sample inputs, as well as provide rapid detection results. To address these challenges, this paper presents a novel steganalysis model based on spatial and temporal feature fusion (STFF-SM). Differing from the existing methods, we take both the integer and fractional pitch delays as input, and design subframe-stitch module to organically integrate subframe-wise integer delays and frame-wise fractional pitch delays. Further, we design a spatial fusion module based on pre-activation residual convolution to extract the pitch spatial features and gradually increase their dimensions to discover finer steganographic distortions to enhance the detection effect, where a Group-Squeeze-Weighting block is introduced to alleviate the information loss in the process of increasing the feature dimension. In addition, we design a temporal fusion module to extract pitch temporal features using the stacked LSTM, where a Gated Feed-Forward Network is introduced to learn the interaction between different feature maps while suppressing the features that are not useful for detection. We evaluated the performance of STFF-SM through comprehensive experiments and comparisons with the state-of-the-art solutions. The experimental results demonstrate that STFF-SM can well meet the needs of real-time detection of speech steganography in VoIP streams, and outperforms the existing methods in detection performance, especially with low embedding strengths and short window sizes.
What problem does this paper attempt to address?