Abstract:While the significant advancements have made in the generation of deepfakes using deep learning technologies, its misuse is a well-known issue now. Deepfakes can cause severe security and privacy issues as they can be used to impersonate a person's identity in a video by replacing his/her face with another person's face. Recently, a new problem of generating synthesized human voice of a person is emerging, where AI-based deep learning models can synthesize any person's voice requiring just a few seconds of audio. With the emerging threat of impersonation attacks using deepfake audios and videos, a new generation of deepfake detectors is needed to focus on both video and audio collectively. To develop a competent deepfake detector, a large amount of high-quality data is typically required to capture real-world (or practical) scenarios. Existing deepfake datasets either contain deepfake videos or audios, which are racially biased as well. As a result, it is critical to develop a high-quality video and audio deepfake dataset that can be used to detect both audio and video deepfakes simultaneously. To fill this gap, we propose a novel Audio-Video Deepfake dataset, FakeAVCeleb, which contains not only deepfake videos but also respective synthesized lip-synced fake audios. We generate this dataset using the most popular deepfake generation methods. We selected real YouTube videos of celebrities with four ethnic backgrounds to develop a more realistic multimodal dataset that addresses racial bias, and further help develop multimodal deepfake detectors. We performed several experiments using state-of-the-art detection methods to evaluate our deepfake dataset and demonstrate the challenges and usefulness of our multimodal Audio-Video deepfake dataset.

Deepfake audio as a data augmentation technique for training automatic speech to text transcription models

Targeted Augmented Data for Audio Deepfake Detection

Transferring Audio Deepfake Detection Capability Across Languages

Efficient Deepfake Audio Detection Using Spectro-Temporal Analysis and Deep Learning

I Can Hear You: Selective Robust Training for Deepfake Audio Detection

Deepfake Audio Detection Using Spectrogram-based Feature and Ensemble of Deep Learning Models

Does Audio Deepfake Detection Generalize?

Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion

Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models

Continuous Learning of Transformer-based Audio Deepfake Detection

AntiDeepFake: AI for Deep Fake Speech Recognition

Adaptive data augmentation for mandarin automatic speech recognition

DFADD: The Diffusion and Flow-Matching Based Audio Deepfake Dataset

Speech-dependent Data Augmentation for Own Voice Reconstruction with Hearable Microphones in Noisy Environments

ALDAS: Audio-Linguistic Data Augmentation for Spoofed Audio Detection

A lightweight feature extraction technique for deepfake audio detection

Improving speech recognition using data augmentation and acoustic model fusion

Enhancing Generalization in Audio Deepfake Detection: A Neural Collapse based Sampling and Training Approach

A robust audio deepfake detection system via multi-view feature

Data Augmentation for Diverse Voice Conversion in Noisy Environments

FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset