Abstract:In this paper, we propose an enhanced audio-visual deep detection method. Recent methods in audio-visual deepfake detection mostly assess the synchronization between audio and visual features. Although they have shown promising results, they are based on the maximization/minimization of isolated feature distances without considering feature statistics. Moreover, they rely on cumbersome deep learning architectures and are heavily dependent on empirically fixed hyperparameters. Herein, to overcome these limitations, we propose: (1) a statistical feature loss to enhance the discrimination capability of the model, instead of relying solely on feature distances; (2) using the waveform for describing the audio as a replacement of frequency-based representations; (3) a post-processing normalization of the fakeness score; (4) the use of shallower network for reducing the computational complexity. Experiments on the DFDC and FakeAVCeleb datasets demonstrate the relevance of the proposed method.
Computer Vision and Pattern Recognition,Multimedia,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### What problems does this paper attempt to solve?
This paper aims to solve several key problems in current audio - visual deepfake detection methods. Specifically, the author points out the following deficiencies of existing methods:
1. **Ignoring feature statistics**: Existing audio - visual deepfake detection methods mainly rely on isolated distance metrics between audio and visual features, while ignoring the overall statistical information of the features. This may lead to limited performance in distinguishing between real and fake data.
2. **Dependence on empirical hyper - parameters**: Existing methods usually rely on a large number of hyper - parameters fixed empirically, which increases the complexity and uncertainty of model parameter tuning.
3. **High computational cost**: Many existing audio - visual models use very deep network architectures, resulting in high computational costs and being unsuitable for real - time detection requirements in practical applications.
4. **Limitations of audio representation methods**: Most methods adopt frequency - based audio representations (such as mel - spectrograms), which may ignore certain discriminative cues and require an additional hyper - parameter adjustment step.
To solve these problems, the author proposes a new method named **Statistics - aware Audio - visual Deepfake Detector (SADD)**. The main improvements of this method include:
- **Introducing statistical - aware loss**: In addition to the traditional feature - distance loss, a new loss function is introduced to evaluate the distance between the first - order statistics (means) of the audio and visual feature distributions. For fake data, this distance is maximized; for real data, it is minimized.
- **Using waveforms instead of frequency - domain representations**: The audio input is changed from a frequency - domain representation (such as a mel - spectrogram) to the original waveform to reduce potential limitations caused by the conversion process and simplify the model structure.
- **Shallow network architecture**: A shallower network architecture is adopted to reduce computational complexity and effectively model deep - fake artifacts using low - level features.
- **Post - processing normalization**: A post - processing strategy is introduced to normalize the fake scores, eliminating the need for manually setting classification thresholds.
Through these improvements, the author hopes to significantly reduce computational costs and improve the robustness and generalization ability of the model while maintaining high detection performance. Experimental results show that SADD outperforms existing methods on the DFDC and FakeA VCeleb datasets.
### Formula summary
- **Feature extraction**:
\[
f_a = A(I_a), \quad f_v = V(I_v)
\]
where \( f_a \) and \( f_v \) are the feature vectors extracted from the audio waveform \( I_a \) and the image sequence \( I_v \), respectively.
- **Classification output**:
\[
y_a = C_a(f_a), \quad y_v = C_v(f_v)
\]
where \( y_a \) and \( y_v \) are the classification predictions of the audio and visual branches, respectively.
- **Total loss function**:
\[
L = L_v + L_a + L_c + \alpha L_s
\]
where \( L_v \) and \( L_a \) are cross - entropy losses, \( L_c \) is a contrast loss, \( L_s \) is a statistical - aware loss, and \(\alpha\) is a weight hyper - parameter.
- **Contrast loss**:
\[
L_c =
\begin{cases}
(d_f)^2 & \text{if } y \text{ is real}, \\
(\max(m - d_f, 0))^2 & \text{if } y \text{ is fake},
\end{cases}
\]
where \( m = 0.99 \) is the boundary value and \( d_f \) is the squared L2 distance between \( f_v \) and \( f_a \).
- **Statistical - aware loss**:
\[
L_s = \frac{1}{n}\sum_{i = 1}^{n}\vert\mu_{a, i}-\mu_{v, i}\vert
\]
where \( n \) is the number of features, \(\mu_{a, i}\) is the \( i\)-th mean of the audio feature distribution, and \(\mu_{v, i}\) is the \( i\)-th mean of the visual feature distribution.