Exploring the Role of Audio in Multimodal Misinformation Detection

Moyang Liu,Yukun Liu,Ruibo Fu,Zhengqi Wen,Jianhua Tao,Xuefei Liu,Guanjun Li
2024-08-23
Abstract:With the rapid development of deepfake technology, especially the deep audio fake technology, misinformation detection on the social media scene meets a great challenge. Social media data often contains multimodal information which includes audio, video, text, and images. However, existing multimodal misinformation detection methods tend to focus only on some of these modalities, failing to comprehensively address information from all modalities. To comprehensively address the various modal information that may appear on social media, this paper constructs a comprehensive multimodal misinformation detection framework. By employing corresponding neural network encoders for each modality, the framework can fuse different modality information and support the multimodal misinformation detection task. Based on the constructed framework, this paper explores the importance of the audio modality in multimodal misinformation detection tasks on social media. By adjusting the architecture of the acoustic encoder, the effectiveness of different acoustic feature encoders in the multimodal misinformation detection tasks is investigated. Furthermore, this paper discovers that audio and video information must be carefully aligned, otherwise the misalignment across different audio and video modalities can severely impair the model performance.
Multimedia
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in the social media scenario, with the rapid development of deep - fake technology (especially audio deep - fake technology), multimodal false information detection faces huge challenges. Existing multimodal false information detection methods usually only focus on the information of certain modalities and fail to comprehensively process the information of all modalities, resulting in unsatisfactory detection effects. Therefore, this paper constructs a comprehensive multimodal false information detection framework, which fuses the information of different modalities by using neural network encoders for each modality and supports multimodal false information detection tasks. In particular, this paper explores the importance of the audio modality in multimodal false information detection and studies the effectiveness of different acoustic feature encoders. In addition, it is also found that audio and video information must be carefully aligned, otherwise misalignment across different modalities will seriously damage the model performance.