FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset

Hasam Khalid,Shahroz Tariq,Minha Kim,Simon S. Woo
DOI: https://doi.org/10.48550/arXiv.2108.05080
2022-03-01
Abstract:While the significant advancements have made in the generation of deepfakes using deep learning technologies, its misuse is a well-known issue now. Deepfakes can cause severe security and privacy issues as they can be used to impersonate a person's identity in a video by replacing his/her face with another person's face. Recently, a new problem of generating synthesized human voice of a person is emerging, where AI-based deep learning models can synthesize any person's voice requiring just a few seconds of audio. With the emerging threat of impersonation attacks using deepfake audios and videos, a new generation of deepfake detectors is needed to focus on both video and audio collectively. To develop a competent deepfake detector, a large amount of high-quality data is typically required to capture real-world (or practical) scenarios. Existing deepfake datasets either contain deepfake videos or audios, which are racially biased as well. As a result, it is critical to develop a high-quality video and audio deepfake dataset that can be used to detect both audio and video deepfakes simultaneously. To fill this gap, we propose a novel Audio-Video Deepfake dataset, FakeAVCeleb, which contains not only deepfake videos but also respective synthesized lip-synced fake audios. We generate this dataset using the most popular deepfake generation methods. We selected real YouTube videos of celebrities with four ethnic backgrounds to develop a more realistic multimodal dataset that addresses racial bias, and further help develop multimodal deepfake detectors. We performed several experiments using state-of-the-art detection methods to evaluate our deepfake dataset and demonstrate the challenges and usefulness of our multimodal Audio-Video deepfake dataset.
Computer Vision and Pattern Recognition,Multimedia,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the insufficiency of current deep - fake detection datasets, especially the lack of high - quality deep - fake datasets that contain both audio and video simultaneously. Existing deep - fake datasets either only contain video or only contain audio, and there are problems of ethnic bias. These problems limit the development of multimodal deep - fake detection methods. Therefore, the paper proposes a new audio - video multimodal deep - fake dataset - FakeA VCeleb, aiming to solve the above - mentioned problems and promote the development of more efficient and comprehensive deep - fake detection technologies. Specifically, the paper points out: 1. **Absence of multimodal datasets**: At present, most deep - fake datasets only focus on generating realistic deep - fake videos while ignoring the generation of corresponding fake audio. This limitation hinders the development of multimodal detection methods that can detect both audio and video deep - fakes simultaneously. 2. **Ethnic bias**: Some existing deep - fake datasets are biased in ethnic representation, which may affect the generalization ability of detection models. 3. **Need for high - quality data**: In order to develop efficient deep - fake detection methods, a large amount of high - quality data is required to capture real - world scenarios. For this reason, the paper proposes the FakeA VCeleb dataset. This dataset not only contains deep - fake videos but also contains corresponding synchronously synthesized fake audio. It selects real YouTube videos of celebrities from different ethnic backgrounds to reduce ethnic bias and further help develop multimodal deep - fake detectors. By using the latest deep - fake generation methods, the paper has generated a dataset containing 20,000 samples and has carried out a variety of experiments to evaluate the effectiveness and challenges of this dataset.