Abstract:The task of deepfakes detection is far from being solved by speech or vision researchers. Several publicly available databases of fake synthetic video and speech were built to aid the development of detection methods. However, existing databases typically focus on visual or voice modalities and provide no proof that their deepfakes can in fact impersonate any real person. In this paper, we present the first realistic audio-visual database of deepfakes SWAN-DF, where lips and speech are well synchronized and video have high visual and audio qualities. We took the publicly available SWAN dataset of real videos with different identities to create audio-visual deepfakes using several models from DeepFaceLab and blending techniques for face swapping and HiFiVC, DiffVC, YourTTS, and FreeVC models for voice conversion. From the publicly available speech dataset LibriTTS, we also created a separate database of only audio deepfakes LibriTTS-DF using several latest text to speech methods: YourTTS, Adaspeech, and TorToiSe. We demonstrate the vulnerability of a state of the art speaker recognition system, such as ECAPA-TDNN-based model from SpeechBrain, to the synthetic voices. Similarly, we tested face recognition system based on the MobileFaceNet architecture to several variants of our visual deepfakes. The vulnerability assessment show that by tuning the existing pretrained deepfake models to specific identities, one can successfully spoof the face and speaker recognition systems in more than 90% of the time and achieve a very realistic looking and sounding fake video of a given person.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the vulnerability of automatic identity recognition systems (including speaker recognition and face recognition) to audio-visual deepfakes. Specifically, the paper focuses on the following aspects: 1. **Limitations of Existing Databases**: - Existing deepfake databases usually focus only on visual or audio modalities, lacking verification of whether deepfakes can actually impersonate specific real individuals. - The quality of deepfake videos in these databases is inconsistent, with some being merely slight distortions of the original videos rather than genuine deepfakes. 2. **Creating a High-Quality Audio-Visual Deepfake Database**: - The paper introduces the first high-fidelity, publicly available audio-visual deepfake database, SWAN-DF, where lip and voice synchronization is good, and the videos have high visual and audio quality. - The database is based on the publicly available SWAN dataset, with audio-visual deepfake samples generated using various deepfake models and techniques. 3. **Evaluating the Threat of Deepfakes to Identity Recognition Systems**: - The paper evaluates the vulnerability of state-of-the-art speaker recognition systems (such as models based on ECAPA-TDNN) and face recognition systems (such as models based on MobileFaceNet) to the generated deepfake samples. - The results show that by adjusting existing pre-trained deepfake models to fit specific identities, it is possible to successfully deceive these identity recognition systems, with a success rate of over 90%. ### Main Contributions 1. **Creation of the SWAN-DF Database**: - Provides a high-quality audio-visual deepfake database containing different versions of deepfake videos generated by multiple models and fusion techniques. - The deepfake samples in the database can realistically mimic the facial and vocal features of target individuals. 2. **Evaluation of Identity Recognition Systems' Vulnerability**: - Experiments verify the effectiveness of different deepfake generation methods in retaining identity information. - Demonstrates the significant threat of deepfakes to existing identity recognition systems, especially after tuning for specific identities. 3. **Public Resources**: - Provides generated audio and video samples, file lists, subset divisions, vulnerability analysis source code, and Jupyter notebooks containing complete results and charts for researchers to use and verify the database transparently. ### Summary By creating the high-quality audio-visual deepfake database SWAN-DF, this paper evaluates the vulnerability of existing identity recognition systems to deepfakes and demonstrates the significant threat deepfakes pose to these systems. The research results highlight the urgency of developing more effective deepfake detection methods.

Vulnerability of Automatic Identity Recognition to Audio-Visual Deepfakes

Deepfakes as a threat to a speaker and facial recognition: An overview of tools and attack vectors

DeepFakes: a New Threat to Face Recognition? Assessment and Detection

Securing Voice Biometrics: One-Shot Learning Approach for Audio Deepfake Detection

Deepfake audio detection by speaker verification

Does Audio Deepfake Detection Generalize?

Warning: Humans Cannot Reliably Detect Speech Deepfakes

Audio-Video Analysis Method of Public Speaking Videos to Detect Deepfake Threat

Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection

A blended framework for audio spoof detection with sequential models and bags of auditory bites

Comprehensive multiparametric analysis of human deepfake speech recognition

Can DeepFake Speech be Reliably Detected?

Voice-Face Homogeneity Tells Deepfake

Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation

Why Do Facial Deepfake Detectors Fail?

FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset

Audio-deepfake detection: Adversarial attacks and countermeasures

SafeEar: Content Privacy-Preserving Audio Deepfake Detection

Lips Are Lying: Spotting the Temporal Inconsistency between Audio and Visual in Lip-Syncing DeepFakes

Unmasking Illusions: Understanding Human Perception of Audiovisual Deepfakes

Human Perception of Audio Deepfakes