Abstract:Automated deception detection is crucial for assisting humans in accurately assessing truthfulness and identifying deceptive behavior. Conventional contact-based techniques, like polygraph devices, rely on physiological signals to determine the authenticity of an individual's statements. Nevertheless, recent developments in automated deception detection have demonstrated that multimodal features derived from both audio and video modalities may outperform human observers on publicly available datasets. Despite these positive findings, the generalizability of existing audio-visual deception detection approaches across different scenarios remains largely unexplored. To close this gap, we present the first cross-domain audio-visual deception detection benchmark, that enables us to assess how well these methods generalize for use in real-world scenarios. We used widely adopted audio and visual features and different architectures for benchmarking, comparing single-to-single and multi-to-single domain generalization performance. To further exploit the impacts using data from multiple source domains for training, we investigate three types of domain sampling strategies, including domain-simultaneous, domain-alternating, and domain-by-domain for multi-to-single domain generalization evaluation. We also propose an algorithm to enhance the generalization performance by maximizing the gradient inner products between modality encoders, named ``MM-IDGM". Furthermore, we proposed the Attention-Mixer fusion method to improve performance, and we believe that this new cross-domain benchmark will facilitate future research in audio-visual deception detection.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the generalization ability problem in cross - domain audio - video spoofing detection. Specifically, the generalization performance of existing audio - video spoofing detection methods in different scenarios has not been fully explored, especially when facing significant domain differences between different datasets. To solve this problem, the author proposes a brand - new cross - domain audio - video spoofing detection benchmark to evaluate the generalization ability of these methods in different scenarios in the real world. ### Main Research Questions 1. **Cross - Domain Generalization Ability**: There are significant domain differences (such as visual features like resolution, illumination, pose, etc., and audio features like pitch, loudness, noise, etc.) between different datasets in existing audio - video spoofing detection models, which leads to insufficient generalization ability of the models in different scenarios. The paper aims to evaluate and improve the generalization ability of the models by introducing a cross - domain benchmark. 2. **Multimodal Fusion**: To deal with domain differences, it is necessary to effectively fuse audio - video modal information. The paper proposes several multimodal fusion methods, including the Attention - Mixer fusion method based on MLP - Mixer and self - attention mechanism, to improve the detection performance of the model. 3. **Domain Sampling Strategy**: To better handle multi - source domain data, the paper proposes three domain sampling strategies: domain - simultaneous, domain - alternating, and domain - by - domain. These strategies aim to evaluate the impact of different sampling methods on the generalization performance of the model. 4. **Gradient Matching Algorithm**: To further improve the generalization performance from multi - source domains to a single - source domain, the paper proposes a new algorithm - Multimodal Inter - Domain Gradient Matching (MM - IDGM). This algorithm aligns the gradient directions of different domains by maximizing the inner product of gradients between modal encoders, thereby enhancing the generalization ability of the model. ### Solutions - **Cross - Domain Benchmark**: A cross - domain audio - video spoofing detection benchmark is constructed, and widely - used audio - video features and different architectures are used for evaluation. - **Domain Sampling Strategy**: Through three different domain sampling strategies, evaluate their impact on the generalization performance from multi - source domains to a single - source domain. - **MM - IDGM Algorithm**: A new gradient matching algorithm is proposed to improve the generalization performance by maximizing the inner product of gradients between modal encoders. - **Attention - Mixer Fusion Method**: A new fusion method is introduced, which combines MLP - Mixer and self - attention mechanism to more effectively capture the interaction information in multimodal data. Through these methods, the paper aims to improve the generalization ability and detection performance of audio - video spoofing detection models in different scenarios, providing important tools and support for future cross - domain spoofing detection research.

Benchmarking Cross-Domain Audio-Visual Deception Detection

Audio-Visual Deception Detection: DOLOS Dataset and Parameter-Efficient Crossmodal Learning

Transferring Audio Deepfake Detection Capability Across Languages

Deception detection using multimodal fusion approaches

CrossMAE: Cross Modality Masked Autoencoders for Region-Aware Audio-Visual Pretraining

Advancing Automated Deception Detection: A Multimodal Approach to Feature Extraction and Analysis

Dual Domain-Adversarial Learning for Audio-Visual Saliency Prediction

Constructing Robust Emotional State-based Feature with a Novel Voting Scheme for Multi-modal Deception Detection in Videos

Appearance Matters, So Does Audio: Revealing the Hidden Face via Cross-Modality Transfer

Deception Detection from Linguistic and Physiological Data Streams Using Bimodal Convolutional Neural Networks

Video-Audio Domain Generalization Via Confounder Disentanglement.

Video and Audio are Images: A Cross-Modal Mixer for Original Data on Video-Audio Retrieval

Benchmarking Joint Face Spoofing and Forgery Detection with Visual and Physiological Cues

Introducing Representations of Facial Affect in Automated Multimodal Deception Detection

Affect-Aware Deep Belief Network Representations for Multimodal Unsupervised Deception Detection

Fine-Grained Question-Level Deception Detection Via Graph-Based Learning and Cross-Modal Fusion

Cross-domain deception detection using support vector networks

Cross-Modality and Within-Modality Regularization for Audio-Visual DeepFake Detection

Detection of Cross-Dataset Fake Audio Based on Prosodic and Pronunciation Features

Visual and audio scene classification for detecting discrepancies in video: a baseline method and experimental protocol