Abstract:In recent years, numerous “face-swapping” videos have emerged in social networks, one of the representatives is the lip forgery with speakers.While making life more entertaining for the public, it poses a significant crisis for personal privacy and property security in cyberspace.Currently, under non-destructive conditions, most of the lip forgery detection methods achieve good performance.However, the compression operations are widely used in practice especially in social media platforms, face recognition and other scenarios.While saving pixel and time redundancy, the compression operations affect the video quality and destroy the coherent integrity of pixel-to-pixel and frame-to-frame in the spatial domain, and then the degradation of its detection performance and even misjudgment of the real video will be caused.When the information in the spatial domain cannot provide sufficiently effective features, the information in the frequency domain naturally becomes a priority research object because it can resist compression interference.Aiming at this problem, the advantages of frequency information in image structure and gradient feedback were analyzed.Then the lip forgery detection via spatial-frequency domain combination was proposed, which effectively utilized the corresponding characteristics of information in spatial and frequency domains.For lip features in the spatial domain, an adaptive extraction network and a light-weight attention module were designed.For frequency features in the frequency domain, separate extraction and fusion modules for different components were designed.Subsequently, by conducting a weighted fusion of lip features in spatial domain and frequency features in frequency domain, more texture information was preserved.In addition, fine-grained constraints were designed during the training to separate the inter-class distance of real and fake lip features while closing the intra-class distance.Experimental results show that, benefiting from the frequency information, the proposed method can enhance the detection accuracy under compression situation with certain transferability.On the other hand, in the ablation study conducted on the core modules, the results verify the effectiveness of the frequency component for anti-compression and the constraint of the dual loss function in training.

Data Fusion for Geometrical and Pixel Based Lip Feature

Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading

Lip Graph Assisted Audio-Visual Speech Recognition Using Bidirectional Synchronous Fusion.

Regression Based Landmark Estimation and Multi-Feature Fusion for Visual Speech Recognition.

Improving Speech Recognition Performance in Noisy Environments by Enhancing Lip Reading Accuracy

Audio-Visual Speech Recognition Using A Two-Step Feature Fusion Strategy.

Information Fusion in Attention Networks Using Adaptive and Multi-level Factorized Bilinear Pooling for Audio-visual Emotion Recognition

Visual Features Extracting & Selecting For Lipreading

Online Early-Late Fusion Based on Adaptive HMM for Sign Language Recognition

LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers

A Novel Lip Descriptor for Audio-Visual Keyword Spotting Based on Adaptive Decision Fusion

Incorporating Lip Features into Audio-Visual Multi-Speaker DOA Estimation by Gated Fusion

Lip Forgery Detection Via Spatial-Frequency Domain Combination

Mutual Information Maximization for Effective Lip Reading

Visual Speech Recognition with Lightweight Psychologically Motivated Gabor Features

Audio-Visual System for Robust Speaker Recognition.

Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers

Graph-based multi-Feature fusion method for speech emotion recognition

Feature Fusion for Facial Landmark Point Location

Speech Fusion to Face: Bridging the Gap Between Human's Vocal Characteristics and Facial Imaging

Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning