Abstract:We propose TSELM, a novel target speaker extraction network that leverages discrete tokens and language models. TSELM utilizes multiple discretized layers from WavLM as input tokens and incorporates cross-attention mechanisms to integrate target speaker information. Language models are employed to capture the sequence dependencies, while a scalable HiFi-GAN is used to reconstruct the audio from the tokens. By applying a cross-entropy loss, TSELM models the probability distribution of output tokens, thus converting the complex regression problem of audio generation into a classification task. Experimental results show that TSELM achieves excellent results in speech quality and comparable results in speech intelligibility.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges in **Target Speaker Extraction (TSE)**. Specifically, TSE aims to extract only the voice of the target speaker from the mixed voices of multiple speakers, rather than just separating the voices of all individuals. Unlike blind speech separation, TSE uses auxiliary information to focus on a specific target speaker. ### Main problems and challenges 1. **Limitations of existing models**: - **Discriminative Models**: These models usually use a masking strategy to directly minimize the distance between the estimated signal and the clean speech signal. However, they have poor generalization ability when dealing with unseen data and may introduce unnecessary distortion. - **Generative Models**: Although generative models can learn the distribution of the target speaker's voice and be used to generate clean speech, there is relatively little research at present, especially in the aspect of discretizing audio representations. 2. **Insufficient application of discretized audio representations**: - Discretized audio representations simplify the audio generation task by converting audio into discrete tokens, transforming it from a complex regression problem into a classification task. Although this method performs well in other tasks, it is less applied in the field of target speaker extraction. 3. **Room for improvement in existing methods**: - For example, SkiM - UniCATS is one of the earliest methods to use discrete tokens in TSE, but it ignores the advantages of the WavLM model and mainly focuses on single - layer output. In addition, its evaluation is limited to speech quality, ignoring important indicators such as intelligibility and speaker similarity. ### Solutions The paper proposes a new framework named **TSELM (Target Speaker Extraction using Discrete Tokens and Language Models)** to solve the above problems: - **Encoding stage**: Use the pre - trained self - supervised learning (SSL) model WavLM to encode the reference speech and the mixed speech, and discretize the continuous representation into tokens through the Kmeans algorithm. - **Modeling stage**: Adopt a cross - attention mechanism to combine the information of the target speaker and use a language model to capture sequence dependencies. - **Decoding stage**: Use the scalable HiFi - GAN to reconstruct audio from discrete tokens. Through this method, TSELM not only improves speech quality but also achieves comparable results in speech intelligibility. Experimental results show that TSELM performs excellently on multiple evaluation indicators, especially achieving a good balance between speech quality and intelligibility. ### Formula summary - **Discretization process**: \[ d=\text{Kmeans}(r) \] where \( r \) is the continuous representation output by WavLM, and \( d \) is the discretized token. - **Cross - attention mechanism**: \[ E_f = \text{FiLM}(E_m, E_{spk})=\gamma E_{spk}\cdot E_m+\beta E_{spk} \] where \( E_m \) is the embedding of the mixed speech, \( E_{spk} \) is the embedding of the reference speech, and \( \gamma \) and \( \beta \) are learnable parameters. - **Loss function**: \[ \mathcal{L}=\text{CrossEntropy}(\hat{d}, d_{\text{clean}}) \] where \( \hat{d} \) is the predicted token and \( d_{\text{clean}} \) is the real token obtained by discretizing clean audio. Through these improvements, TSELM provides a novel and effective method for solving the target speaker extraction problem.

TSELM: Target Speaker Extraction using Discrete Tokens and Language Models

SELM: Speech Enhancement Using Discrete Tokens and Language Models

Typing to Listen at the Cocktail Party: Text-Guided Target Speaker Extraction

3S-TSE: Efficient Three-Stage Target Speaker Extraction for Real-Time and Low-Resource Applications

Improving Target Speaker Extraction with Sparse LDA-transformed Speaker Embeddings

Probing Self-supervised Learning Models with Target Speech Extraction

SMMA-Net: An Audio Clue-Based Target Speaker Extraction Network with Spectrogram Matching and Mutual Attention.

Target Speech Extraction with Pre-trained Self-supervised Learning Models

USEF-TSE: Universal Speaker Embedding Free Target Speaker Extraction

X-SepFormer: End-to-end Speaker Extraction Network with Explicit Optimization on Speaker Confusion

Target Speaker Extraction Using Attention-Enhanced Temporal Convolutional Network

SpeakerBeam-SS: Real-time Target Speaker Extraction with Lightweight Conv-TasNet and State Space Modeling

Multi-Level Speaker Representation for Target Speaker Extraction

High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model

Language-Queried Target Sound Extraction Without Parallel Training Data

SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer

Continuous Target Speech Extraction: Enhancing Personalized Diarization and Extraction on Complex Recordings

X-CrossNet: A complex spectral mapping approach to target speaker extraction with cross attention speaker embedding fusion

Target Speaker Extraction by Directly Exploiting Contextual Information in the Time-Frequency Domain

Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention

pTSE-T: Presentation Target Speaker Extraction using Unaligned Text Cues