A Video Visual Security Metric Based on Spatiotemporal Self-Attention
Bo Tang,Fengdong Li,Jianbo Liu,Cheng Yang
DOI: https://doi.org/10.1109/tifs.2024.3459731
IF: 7.231
2024-10-09
IEEE Transactions on Information Forensics and Security
Abstract:The Visual Security Index (VSI) of encrypted videos measure the security of encryption algorithms by evaluating the visual information content, which provides a critical evaluation criterion for selective encryption. The VSI for encrypted videos needs to assess security in both spatial and temporal domains. Existing visual security metrics, which rely on averaging, optical flow, and convolutions, fail to capture information leakage in the temporal domain effectively. This paper proposes a spatiotemporal self-attention-based video security assessment model called Spatiotemporal Self Attention (StSA). In the spatial domain, windowed self-attention is used to calculate regional correlations within video frames. By introducing multi-layer outputs, a multi-depth self-attention network named Multi-Depth Swin-Transformer (MDST) is constructed to compute the regional correlation within video frames. A weak label calculation method based on edge similarity is proposed to calculate the scores for frames and blocks based on the video Mean Opinion Score (MOS), thereby supporting the pre-training of spatial models. In the temporal domain, considering human visual persistence characteristics and the one-way relationship between video frames, temporal unidirectional window self-attention is proposed to calculate frame correlations in the temporal sequence. Finally, the visual security index score for encrypted videos is obtained by combining the spatiotemporal correlation changes of encrypted and plaintext videos. Experimental results show that StSA achieves a Pearson Linear Correlation Coefficient (PLCC) of 0.955 and a Root Mean Squared Error (RMSE) of 0.458 on the encryption datasets. Compared to other visual security metrics, StSA demonstrates higher accuracy and correlation, effectively capturing spatiotemporal information leakage in encrypted videos and reflecting the human perception of the security.
computer science, theory & methods,engineering, electrical & electronic