Abstract:TSE aims to extract the clean speech of the target speaker in an audio mixture, thus eliminating irrelevant background noise and speech. While prior work has explored various auxiliary cues including pre-recorded speech, visual information (e.g., lip motions and gestures), and spatial information, the acquisition and selection of such strong cues are infeasible in many practical scenarios. Unlike all existing work, in this paper, we condition the TSE algorithm on semantic cues extracted from limited and unaligned text content, such as condensed points from a presentation slide. This method is particularly useful in scenarios like meetings, poster sessions, or lecture presentations, where acquiring other cues in real-time is challenging. To this end, we design two different networks. Specifically, our proposed TPE fuses audio features with content-based semantic cues to facilitate time-frequency mask generation to filter out extraneous noise, while another proposal, namely TSR, employs the contrastive learning technique to associate blindly separated speech signals with semantic cues. The experimental results show the efficacy in accurately identifying the target speaker by utilizing semantic cues derived from limited and unaligned text, resulting in SI-SDRi of 12.16 dB, SDRi of 12.66 dB, PESQi of 0.830 and STOIi of 0.150, respectively. Dataset and source code will be publicly available. Project demo page: <a class="link-external link-https" href="https://slideTSE.github.io/" rel="external noopener nofollow">this https URL</a>.

X-SepFormer: End-to-end Speaker Extraction Network with Explicit Optimization on Speaker Confusion

Improving Target Speaker Extraction with Sparse LDA-transformed Speaker Embeddings

X-CrossNet: A complex spectral mapping approach to target speaker extraction with cross attention speaker embedding fusion

WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction

Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information

Target Speech Extraction with Pre-trained Self-supervised Learning Models

Cross-Speaker Encoding Network for Multi-Talker Speech Recognition

Multi-Level Speaker Representation for Target Speaker Extraction

SMMA-Net: An Audio Clue-Based Target Speaker Extraction Network with Spectrogram Matching and Mutual Attention.

pTSE-T: Presentation Target Speaker Extraction using Unaligned Text Cues

X-TaSNet: Robust and Accurate Time-Domain Speaker Extraction Network

Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention

3S-TSE: Efficient Three-Stage Target Speaker Extraction for Real-Time and Low-Resource Applications

Probing Self-supervised Learning Models with Target Speech Extraction

Speaker-conditioning Single-channel Target Speaker Extraction using Conformer-based Architectures

TSELM: Target Speaker Extraction using Discrete Tokens and Language Models

Robust Speaker Extraction Network Based on Iterative Refined Adaptation

A Hybrid Continuity Loss to Reduce Over-Suppression for Time-domain Target Speaker Extraction

Improving curriculum learning for target speaker extraction with synthetic speakers

MC-SpEx: Towards Effective Speaker Extraction with Multi-Scale Interfusion and Conditional Speaker Modulation

Augmenting Transformer-Transducer Based Speaker Change Detection With Token-Level Training Loss