Spatio-temporal Features for Generalized Detection of Deepfake Videos

Ipek Ganiyusufoglu,L. Minh Ngô,Nedko Savov,Sezer Karaoglu,Theo Gevers
DOI: https://doi.org/10.48550/arXiv.2010.11844
2020-10-23
Abstract:For deepfake detection, video-level detectors have not been explored as extensively as image-level detectors, which do not exploit temporal data. In this paper, we empirically show that existing approaches on image and sequence classifiers generalize poorly to new manipulation techniques. To this end, we propose spatio-temporal features, modeled by 3D CNNs, to extend the generalization capabilities to detect new sorts of deepfake videos. We show that spatial features learn distinct deepfake-method-specific attributes, while spatio-temporal features capture shared attributes between deepfake methods. We provide an in-depth analysis of how the sequential and spatio-temporal video encoders are utilizing temporal information using DFDC dataset <a class="link-https" data-arxiv-id="2006.07397" href="https://arxiv.org/abs/2006.07397">arXiv:2006.07397</a>. Thus, we unravel that our approach captures local spatio-temporal relations and inconsistencies in the deepfake videos while existing sequence encoders are indifferent to it. Through large scale experiments conducted on the FaceForensics++ <a class="link-https" data-arxiv-id="1901.08971" href="https://arxiv.org/abs/1901.08971">arXiv:1901.08971</a> and Deeper Forensics <a class="link-https" data-arxiv-id="2001.03024" href="https://arxiv.org/abs/2001.03024">arXiv:2001.03024</a> datasets, we show that our approach outperforms existing methods in terms of generalization capabilities.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?