Abstract:The goal of speech separation is to extract multiple speech sources from a single microphone recording. Recently, with the advancement of deep learning and availability of large datasets, speech separation has been formulated as a supervised learning problem. These approaches aim to learn discriminative patterns of speech, speakers, and background noise using a supervised learning algorithm, typically a deep neural network. A long-lasting problem in supervised speech separation is finding the correct label for each separated speech signal, referred to as label permutation ambiguity. Permutation ambiguity refers to the problem of determining the output-label assignment between the separated sources and the available single-speaker speech labels. Finding the best output-label assignment is required for calculation of separation error, which is later used for updating parameters of the model. Recently, Permutation Invariant Training (PIT) has been shown to be a promising solution in handling the label ambiguity problem. However, the overconfident choice of the output-label assignment by PIT results in a sub-optimal trained model. In this work, we propose a probabilistic optimization framework to address the inefficiency of PIT in finding the best output-label assignment. Our proposed method entitled trainable Soft-minimum PIT is then employed on the same Long-Short Term Memory (LSTM) architecture used in Permutation Invariant Training (PIT) speech separation method. The results of our experiments show that the proposed method outperforms conventional PIT speech separation significantly (p-value <0.01 ) by +1dB in Signal to Distortion Ratio (SDR) and +1.5dB in Signal to Interference Ratio (SIR).

Progressive Learning for Stabilizing Label Selection in Speech Separation with Mapping-based Method

Improving Label Assignments Learning by Dynamic Sample Dropout Combined with Layer-wise Optimization in Speech Separation

Stabilizing Label Assignment for Speech Separation by Self-Supervised Pre-Training.

Interrupted and cascaded permutation invariant training for speech separation

Self-supervised Pre-training Reduces Label Permutation Instability of Speech Separation

Single-channel speech separation using Soft-minimum Permutation Invariant Training

Permutation invariant training of deep models for speaker-independent multi-talker speech separation

A Progressive Deep Learning Approach to Child Speech Separation

Utterance-level Permutation Invariant Training with Latency-controlled BLSTM for Single-channel Multi-talker Speech Separation

On Permutation Invariant Training For Speech Source Separation

Progressive Multi-scale Self-supervised Learning for Speech Recognition

Utterance-level Permutation Invariant Training with Discriminative Learning for Single Channel Speech Separation

Probabilistic Permutation Invariant Training for Speech Separation

Multiple Choice Learning for Efficient Speech Separation with Many Speakers

Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features

Location-Based Training for Multi-Channel Talker-Independent Speaker Separation

Supervised Speech Separation Based on Deep Learning: An Overview

PGSS: Pitch-Guided Speech Separation.

Speech Separation with Pretrained Frontend to Minimize Domain Mismatch

Time-Domain Mapping with Convolution Networks for End-to-End Monaural Speech Separation