Abstract:Face forgery videos have caused severe public concerns, and many detectors have been proposed. However, most of these detectors suffer from limited generalization when detecting videos from unknown distributions, such as from unseen forgery methods. In this paper, we find that different forgery videos have distinct spatiotemporal patterns, which may be the key to generalization. To leverage this finding, we propose a Latent Spatiotemporal Adaptation~(LAST) approach to facilitate generalized face forgery video detection. The key idea is to optimize the detector adaptive to the spatiotemporal patterns of unknown videos in latent space to improve the generalization. Specifically, we first model the spatiotemporal patterns of face videos by incorporating a lightweight CNN to extract local spatial features of each frame and then cascading a vision transformer to learn the long-term spatiotemporal representations in latent space, which should contain more clues than in pixel space. Then by optimizing a transferable linear head to perform the usual forgery detection task on known videos and recover the spatiotemporal clues of unknown target videos in a semi-supervised manner, our detector could flexibly adapt to unknown videos' spatiotemporal patterns, leading to improved generalization. Additionally, to eliminate the influence of specific forgery videos, we pre-train our CNN and transformer only on real videos with two simple yet effective self-supervised tasks: reconstruction and contrastive learning in latent space and keep them frozen during fine-tuning. Extensive experiments on public datasets demonstrate that our approach achieves state-of-the-art performance against other competitors with impressive generalization and robustness.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the limited generalization ability of existing fake video detectors when facing unknown data distributions (for example, videos from unseen forgery methods). Specifically, existing detectors perform well in detecting specific fake videos, but when encountering videos generated by new or different forgery methods, their performance will decline significantly. This is because videos generated by different forgery methods have different spatio - temporal patterns, and these differences lead to distribution gaps in the latent space, thus affecting the generalization ability of the detectors.
To solve this problem, the author proposes a method named **Latent Spatiotemporal Adaptation (LAST)**, aiming to improve the generalization ability by optimizing the detector to adapt to the spatio - temporal patterns of unknown videos. The specific technical details are as follows:
1. **Spatio - temporal Representation Learning**:
- Use a lightweight CNN to extract local spatial features of each frame.
- Then use a Vision Transformer to learn long - term spatio - temporal representations, which contain more useful cues than the pixel space.
2. **Latent Spatiotemporal Adaptation**:
- By optimizing a transferable linear head, perform the regular forgery detection task on known videos and recover the spatio - temporal cues of unknown target videos in a semi - supervised manner, enabling the detector to flexibly adapt to the spatio - temporal patterns of unknown videos.
- Introduce a CNN reconstructor to recover local spatial features from the learned spatio - temporal representations, which can force the model to learn more about the spatio - temporal cues of the target unknown videos, thus alleviating the distribution gap between known and unknown videos in the latent space.
3. **Common Spatio - temporal Initialization**:
- To eliminate the influence of specific fake videos, the author proposes to pre - train the CNN and the transformer only on real videos, using two simple self - supervised tasks: reconstruction learning and contrast learning.
- The reconstruction learning task aims to recover local spatial features from the learned spatio - temporal representations, ensuring that the latent space contains more spatio - temporal cues about the input real videos.
- The contrast learning task learns the common spatio - temporal representations of real videos by exploring the spatio - temporal relationships within the same video and between different videos.
Through the above methods, LAST can flexibly adapt to the spatio - temporal patterns of unknown videos without relying on the true labels of unknown videos, thereby improving the generalization ability and robustness of the detector. Experimental results show that LAST has achieved better performance than other competing methods on multiple public datasets.