A Method of Audio-Visual Person Verification by Mining Connections Between Time Series

Peiwen Sun,Shanshan Zhang,Zishan Liu,Yougen Yuan,Taotao Zhang,Honggang Zhang,Pengfei Hu
DOI: https://doi.org/10.21437/interspeech.2023-394
2023-01-01
Abstract:It has already been observed that audio-visual embedding is more robust than uni-modality embedding for person verification. But the relationship of keyframes in time series between modalities seems to be unexplored. Hence, we proposed a novel audio-visual strategy that considers connections between time series from a generative perspective. First, we introduced weight-enhanced attentive statistics pooling to extend the salience of the keyframe weights. Then, joint attentive pooling incorporating 3 popular generative supervision models is proposed. Finally, each modality is fused with a gated attention mechanism to gain robust embedding. All the proposed models are trained on the VoxCeleb2 dev dataset and the best system obtains 0.14%, 0.21%, and 0.37% EER on three official trial lists of VoxCeleb1 respectively, which is to our knowledge the best-published results for person verification.
What problem does this paper attempt to address?