TSELM: Target Speaker Extraction using Discrete Tokens and Language Models

Beilong Tang,Bang Zeng,Ming Li
2024-09-17
Abstract:We propose TSELM, a novel target speaker extraction network that leverages discrete tokens and language models. TSELM utilizes multiple discretized layers from WavLM as input tokens and incorporates cross-attention mechanisms to integrate target speaker information. Language models are employed to capture the sequence dependencies, while a scalable HiFi-GAN is used to reconstruct the audio from the tokens. By applying a cross-entropy loss, TSELM models the probability distribution of output tokens, thus converting the complex regression problem of audio generation into a classification task. Experimental results show that TSELM achieves excellent results in speech quality and comparable results in speech intelligibility.
Sound,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges in **Target Speaker Extraction (TSE)**. Specifically, TSE aims to extract only the voice of the target speaker from the mixed voices of multiple speakers, rather than just separating the voices of all individuals. Unlike blind speech separation, TSE uses auxiliary information to focus on a specific target speaker. ### Main problems and challenges 1. **Limitations of existing models**: - **Discriminative Models**: These models usually use a masking strategy to directly minimize the distance between the estimated signal and the clean speech signal. However, they have poor generalization ability when dealing with unseen data and may introduce unnecessary distortion. - **Generative Models**: Although generative models can learn the distribution of the target speaker's voice and be used to generate clean speech, there is relatively little research at present, especially in the aspect of discretizing audio representations. 2. **Insufficient application of discretized audio representations**: - Discretized audio representations simplify the audio generation task by converting audio into discrete tokens, transforming it from a complex regression problem into a classification task. Although this method performs well in other tasks, it is less applied in the field of target speaker extraction. 3. **Room for improvement in existing methods**: - For example, SkiM - UniCATS is one of the earliest methods to use discrete tokens in TSE, but it ignores the advantages of the WavLM model and mainly focuses on single - layer output. In addition, its evaluation is limited to speech quality, ignoring important indicators such as intelligibility and speaker similarity. ### Solutions The paper proposes a new framework named **TSELM (Target Speaker Extraction using Discrete Tokens and Language Models)** to solve the above problems: - **Encoding stage**: Use the pre - trained self - supervised learning (SSL) model WavLM to encode the reference speech and the mixed speech, and discretize the continuous representation into tokens through the Kmeans algorithm. - **Modeling stage**: Adopt a cross - attention mechanism to combine the information of the target speaker and use a language model to capture sequence dependencies. - **Decoding stage**: Use the scalable HiFi - GAN to reconstruct audio from discrete tokens. Through this method, TSELM not only improves speech quality but also achieves comparable results in speech intelligibility. Experimental results show that TSELM performs excellently on multiple evaluation indicators, especially achieving a good balance between speech quality and intelligibility. ### Formula summary - **Discretization process**: \[ d=\text{Kmeans}(r) \] where \( r \) is the continuous representation output by WavLM, and \( d \) is the discretized token. - **Cross - attention mechanism**: \[ E_f = \text{FiLM}(E_m, E_{spk})=\gamma E_{spk}\cdot E_m+\beta E_{spk} \] where \( E_m \) is the embedding of the mixed speech, \( E_{spk} \) is the embedding of the reference speech, and \( \gamma \) and \( \beta \) are learnable parameters. - **Loss function**: \[ \mathcal{L}=\text{CrossEntropy}(\hat{d}, d_{\text{clean}}) \] where \( \hat{d} \) is the predicted token and \( d_{\text{clean}} \) is the real token obtained by discretizing clean audio. Through these improvements, TSELM provides a novel and effective method for solving the target speaker extraction problem.