Benchmarking Cross-Domain Audio-Visual Deception Detection
Xiaobao Guo,Zitong Yu,Nithish Muthuchamy Selvaraj,Bingquan Shen,Adams Wai-Kin Kong,Alex C. Kot
2024-10-05
Abstract:Automated deception detection is crucial for assisting humans in accurately assessing truthfulness and identifying deceptive behavior. Conventional contact-based techniques, like polygraph devices, rely on physiological signals to determine the authenticity of an individual's statements. Nevertheless, recent developments in automated deception detection have demonstrated that multimodal features derived from both audio and video modalities may outperform human observers on publicly available datasets. Despite these positive findings, the generalizability of existing audio-visual deception detection approaches across different scenarios remains largely unexplored. To close this gap, we present the first cross-domain audio-visual deception detection benchmark, that enables us to assess how well these methods generalize for use in real-world scenarios. We used widely adopted audio and visual features and different architectures for benchmarking, comparing single-to-single and multi-to-single domain generalization performance. To further exploit the impacts using data from multiple source domains for training, we investigate three types of domain sampling strategies, including domain-simultaneous, domain-alternating, and domain-by-domain for multi-to-single domain generalization evaluation. We also propose an algorithm to enhance the generalization performance by maximizing the gradient inner products between modality encoders, named ``MM-IDGM". Furthermore, we proposed the Attention-Mixer fusion method to improve performance, and we believe that this new cross-domain benchmark will facilitate future research in audio-visual deception detection.
Sound,Computer Vision and Pattern Recognition,Multimedia,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the generalization ability problem in cross - domain audio - video spoofing detection. Specifically, the generalization performance of existing audio - video spoofing detection methods in different scenarios has not been fully explored, especially when facing significant domain differences between different datasets. To solve this problem, the author proposes a brand - new cross - domain audio - video spoofing detection benchmark to evaluate the generalization ability of these methods in different scenarios in the real world.
### Main Research Questions
1. **Cross - Domain Generalization Ability**: There are significant domain differences (such as visual features like resolution, illumination, pose, etc., and audio features like pitch, loudness, noise, etc.) between different datasets in existing audio - video spoofing detection models, which leads to insufficient generalization ability of the models in different scenarios. The paper aims to evaluate and improve the generalization ability of the models by introducing a cross - domain benchmark.
2. **Multimodal Fusion**: To deal with domain differences, it is necessary to effectively fuse audio - video modal information. The paper proposes several multimodal fusion methods, including the Attention - Mixer fusion method based on MLP - Mixer and self - attention mechanism, to improve the detection performance of the model.
3. **Domain Sampling Strategy**: To better handle multi - source domain data, the paper proposes three domain sampling strategies: domain - simultaneous, domain - alternating, and domain - by - domain. These strategies aim to evaluate the impact of different sampling methods on the generalization performance of the model.
4. **Gradient Matching Algorithm**: To further improve the generalization performance from multi - source domains to a single - source domain, the paper proposes a new algorithm - Multimodal Inter - Domain Gradient Matching (MM - IDGM). This algorithm aligns the gradient directions of different domains by maximizing the inner product of gradients between modal encoders, thereby enhancing the generalization ability of the model.
### Solutions
- **Cross - Domain Benchmark**: A cross - domain audio - video spoofing detection benchmark is constructed, and widely - used audio - video features and different architectures are used for evaluation.
- **Domain Sampling Strategy**: Through three different domain sampling strategies, evaluate their impact on the generalization performance from multi - source domains to a single - source domain.
- **MM - IDGM Algorithm**: A new gradient matching algorithm is proposed to improve the generalization performance by maximizing the inner product of gradients between modal encoders.
- **Attention - Mixer Fusion Method**: A new fusion method is introduced, which combines MLP - Mixer and self - attention mechanism to more effectively capture the interaction information in multimodal data.
Through these methods, the paper aims to improve the generalization ability and detection performance of audio - video spoofing detection models in different scenarios, providing important tools and support for future cross - domain spoofing detection research.