Abstract:Learning an effective speaker representation is crucial for achieving reliable performance in speaker verification tasks. Speech signals are high-dimensional, long, and variable-length sequences containing diverse information at each time-frequency (TF) location. The standard convolutional layer that operates on neighboring local regions often fails to capture the complex TF global information. Our motivation is to alleviate these challenges by increasing the modeling capacity, emphasizing significant information, and suppressing possible redundancies. We aim to design a more robust and efficient speaker recognition system by incorporating the benefits of attention mechanisms and Discrete Cosine Transform (DCT) based signal processing techniques, to effectively represent the global information in speech signals. To achieve this, we propose a general global time-frequency context modeling block for speaker modeling. First, an attention-based context model is introduced to capture the long-range and non-local relationship across different time-frequency locations. Second, a 2D-DCT based context model is proposed to improve model efficiency and examine the benefits of signal modeling. A multi-DCT attention mechanism is presented to improve modeling power with alternate DCT base forms. Finally, the global context information is used to recalibrate salient time-frequency locations by computing the similarity between the global context and local features. This effectively improves the speaker verification performance compared to the standard ResNet model and Squeeze & Excitation block by a large margin. Our experimental results show that the proposed global context modeling method can efficiently improve the learned speaker representations by achieving channel-wise and time-frequency feature recalibration.

Target Speaker Extraction by Directly Exploiting Contextual Information in the Time-Frequency Domain

Target Speaker Extraction Using Attention-Enhanced Temporal Convolutional Network

USEF-TSE: Universal Speaker Embedding Free Target Speaker Extraction

Improving Target Speaker Extraction with Sparse LDA-transformed Speaker Embeddings

Multi-Level Speaker Representation for Target Speaker Extraction

Speaker-conditioning Single-channel Target Speaker Extraction using Conformer-based Architectures

DENSE: Dynamic Embedding Causal Target Speech Extraction

Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information

Continuous Target Speech Extraction: Enhancing Personalized Diarization and Extraction on Complex Recordings

Attention and DCT based Global Context Modeling for Text-independent Speaker Recognition

3S-TSE: Efficient Three-Stage Target Speaker Extraction for Real-Time and Low-Resource Applications

Target conversation extraction: Source separation using turn-taking dynamics

Selective Listening by Synchronizing Speech with Lips

X-CrossNet: A complex spectral mapping approach to target speaker extraction with cross attention speaker embedding fusion

Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention

Speaker Representation Learning using Global Context Guided Channel and Time-Frequency Transformations

Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-talker Speech

Boosting the Performance of SpEx plus by Attention and Contextual Mechanism

Binaural Selective Attention Model for Target Speaker Extraction

Typing to Listen at the Cocktail Party: Text-Guided Target Speaker Extraction

Quantitative Evidence on Overlooked Aspects of Enrollment Speaker Embeddings for Target Speaker Separation