Abstract:While the significant advancements have made in the generation of deepfakes using deep learning technologies, its misuse is a well-known issue now. Deepfakes can cause severe security and privacy issues as they can be used to impersonate a person's identity in a video by replacing his/her face with another person's face. Recently, a new problem of generating synthesized human voice of a person is emerging, where AI-based deep learning models can synthesize any person's voice requiring just a few seconds of audio. With the emerging threat of impersonation attacks using deepfake audios and videos, a new generation of deepfake detectors is needed to focus on both video and audio collectively. To develop a competent deepfake detector, a large amount of high-quality data is typically required to capture real-world (or practical) scenarios. Existing deepfake datasets either contain deepfake videos or audios, which are racially biased as well. As a result, it is critical to develop a high-quality video and audio deepfake dataset that can be used to detect both audio and video deepfakes simultaneously. To fill this gap, we propose a novel Audio-Video Deepfake dataset, FakeAVCeleb, which contains not only deepfake videos but also respective synthesized lip-synced fake audios. We generate this dataset using the most popular deepfake generation methods. We selected real YouTube videos of celebrities with four ethnic backgrounds to develop a more realistic multimodal dataset that addresses racial bias, and further help develop multimodal deepfake detectors. We performed several experiments using state-of-the-art detection methods to evaluate our deepfake dataset and demonstrate the challenges and usefulness of our multimodal Audio-Video deepfake dataset.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the insufficiency of current deep - fake detection datasets, especially the lack of high - quality deep - fake datasets that contain both audio and video simultaneously. Existing deep - fake datasets either only contain video or only contain audio, and there are problems of ethnic bias. These problems limit the development of multimodal deep - fake detection methods. Therefore, the paper proposes a new audio - video multimodal deep - fake dataset - FakeA VCeleb, aiming to solve the above - mentioned problems and promote the development of more efficient and comprehensive deep - fake detection technologies. Specifically, the paper points out: 1. **Absence of multimodal datasets**: At present, most deep - fake datasets only focus on generating realistic deep - fake videos while ignoring the generation of corresponding fake audio. This limitation hinders the development of multimodal detection methods that can detect both audio and video deep - fakes simultaneously. 2. **Ethnic bias**: Some existing deep - fake datasets are biased in ethnic representation, which may affect the generalization ability of detection models. 3. **Need for high - quality data**: In order to develop efficient deep - fake detection methods, a large amount of high - quality data is required to capture real - world scenarios. For this reason, the paper proposes the FakeA VCeleb dataset. This dataset not only contains deep - fake videos but also contains corresponding synchronously synthesized fake audio. It selects real YouTube videos of celebrities from different ethnic backgrounds to reduce ethnic bias and further help develop multimodal deep - fake detectors. By using the latest deep - fake generation methods, the paper has generated a dataset containing 20,000 samples and has carried out a variety of experiments to evaluate the effectiveness and challenges of this dataset.

FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset

Evaluation of an Audio-Video Multimodal Deepfake Dataset using Unimodal and Multimodal Detectors

Hindi audio-video-Deepfake (HAV-DF): A Hindi language-based Audio-video Deepfake Dataset

AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset

Video and Audio Deepfake Datasets and Open Issues in Deepfake Technology: Being Ahead of the Curve

A Multimodal Framework for Deepfake Detection

Celeb-DF: A Large-scale Challenging Dataset for DeepFake Forensics

PolyGlotFake: A Novel Multilingual and Multimodal DeepFake Dataset

AVoiD-DF: Audio-Visual Joint Learning for Detecting Deepfake

AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Video Deepfake Detection

Multimodaltrace: Deepfake Detection using Audiovisual Representation Learning

Audio Deepfake Attribution: An Initial Dataset and Investigation

Understanding Audiovisual Deepfake Detection: Techniques, Challenges, Human Factors and Perceptual Insights

AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting Multiple Experts for Video Deepfake Detection

Multimodal Deepfake Detection for Short Videos

MIS-AVoiDD: Modality Invariant and Specific Representation for Audio-Visual Deepfake Detection

Vulnerability of Automatic Identity Recognition to Audio-Visual Deepfakes

I Can Hear You: Selective Robust Training for Deepfake Audio Detection

AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection

A Multi-Stream Fusion Approach with One-Class Learning for Audio-Visual Deepfake Detection

Securing Voice Biometrics: One-Shot Learning Approach for Audio Deepfake Detection