Vulnerability of Automatic Identity Recognition to Audio-Visual Deepfakes

Pavel Korshunov,Haolin Chen,Philip N. Garner,Sebastien Marcel
2023-11-29
Abstract:The task of deepfakes detection is far from being solved by speech or vision researchers. Several publicly available databases of fake synthetic video and speech were built to aid the development of detection methods. However, existing databases typically focus on visual or voice modalities and provide no proof that their deepfakes can in fact impersonate any real person. In this paper, we present the first realistic audio-visual database of deepfakes SWAN-DF, where lips and speech are well synchronized and video have high visual and audio qualities. We took the publicly available SWAN dataset of real videos with different identities to create audio-visual deepfakes using several models from DeepFaceLab and blending techniques for face swapping and HiFiVC, DiffVC, YourTTS, and FreeVC models for voice conversion. From the publicly available speech dataset LibriTTS, we also created a separate database of only audio deepfakes LibriTTS-DF using several latest text to speech methods: YourTTS, Adaspeech, and TorToiSe. We demonstrate the vulnerability of a state of the art speaker recognition system, such as ECAPA-TDNN-based model from SpeechBrain, to the synthetic voices. Similarly, we tested face recognition system based on the MobileFaceNet architecture to several variants of our visual deepfakes. The vulnerability assessment show that by tuning the existing pretrained deepfake models to specific identities, one can successfully spoof the face and speaker recognition systems in more than 90% of the time and achieve a very realistic looking and sounding fake video of a given person.
Computer Vision and Pattern Recognition,Artificial Intelligence,Multimedia,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the vulnerability of automatic identity recognition systems (including speaker recognition and face recognition) to audio-visual deepfakes. Specifically, the paper focuses on the following aspects: 1. **Limitations of Existing Databases**: - Existing deepfake databases usually focus only on visual or audio modalities, lacking verification of whether deepfakes can actually impersonate specific real individuals. - The quality of deepfake videos in these databases is inconsistent, with some being merely slight distortions of the original videos rather than genuine deepfakes. 2. **Creating a High-Quality Audio-Visual Deepfake Database**: - The paper introduces the first high-fidelity, publicly available audio-visual deepfake database, SWAN-DF, where lip and voice synchronization is good, and the videos have high visual and audio quality. - The database is based on the publicly available SWAN dataset, with audio-visual deepfake samples generated using various deepfake models and techniques. 3. **Evaluating the Threat of Deepfakes to Identity Recognition Systems**: - The paper evaluates the vulnerability of state-of-the-art speaker recognition systems (such as models based on ECAPA-TDNN) and face recognition systems (such as models based on MobileFaceNet) to the generated deepfake samples. - The results show that by adjusting existing pre-trained deepfake models to fit specific identities, it is possible to successfully deceive these identity recognition systems, with a success rate of over 90%. ### Main Contributions 1. **Creation of the SWAN-DF Database**: - Provides a high-quality audio-visual deepfake database containing different versions of deepfake videos generated by multiple models and fusion techniques. - The deepfake samples in the database can realistically mimic the facial and vocal features of target individuals. 2. **Evaluation of Identity Recognition Systems' Vulnerability**: - Experiments verify the effectiveness of different deepfake generation methods in retaining identity information. - Demonstrates the significant threat of deepfakes to existing identity recognition systems, especially after tuning for specific identities. 3. **Public Resources**: - Provides generated audio and video samples, file lists, subset divisions, vulnerability analysis source code, and Jupyter notebooks containing complete results and charts for researchers to use and verify the database transparently. ### Summary By creating the high-quality audio-visual deepfake database SWAN-DF, this paper evaluates the vulnerability of existing identity recognition systems to deepfakes and demonstrates the significant threat deepfakes pose to these systems. The research results highlight the urgency of developing more effective deepfake detection methods.