Abstract:The goal of speech separation is to extract multiple speech sources from a single microphone recording. Recently, with the advancement of deep learning and availability of large datasets, speech separation has been formulated as a supervised learning problem. These approaches aim to learn discriminative patterns of speech, speakers, and background noise using a supervised learning algorithm, typically a deep neural network. A long-lasting problem in supervised speech separation is finding the correct label for each separated speech signal, referred to as label permutation ambiguity. Permutation ambiguity refers to the problem of determining the output-label assignment between the separated sources and the available single-speaker speech labels. Finding the best output-label assignment is required for calculation of separation error, which is later used for updating parameters of the model. Recently, Permutation Invariant Training (PIT) has been shown to be a promising solution in handling the label ambiguity problem. However, the overconfident choice of the output-label assignment by PIT results in a sub-optimal trained model. In this work, we propose a probabilistic optimization framework to address the inefficiency of PIT in finding the best output-label assignment. Our proposed method entitled trainable Soft-minimum PIT is then employed on the same Long-Short Term Memory (LSTM) architecture used in Permutation Invariant Training (PIT) speech separation method. The results of our experiments show that the proposed method outperforms conventional PIT speech separation significantly (p-value <0.01 ) by +1dB in Signal to Distortion Ratio (SDR) and +1.5dB in Signal to Interference Ratio (SIR).

Multi-Band Pit And Model Integration For Improved Multi-Channel Speech Separation

Beamforming and Deep Models Integrated Multi-talker Speech Separation

Permutation invariant training of deep models for speaker-independent multi-talker speech separation

Adaptive Permutation Invariant Training with Auxiliary Information for Monaural Multi-Talker Speech Recognition.

Location-Based Training for Multi-Channel Talker-Independent Speaker Separation

Single-channel speech separation integrating pitch information based on a multi task learning framework

Progressive Joint Modeling in Unsupervised Single-Channel Overlapped Speech Recognition.

Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training

Cracking the cocktail party problem by multi-beam deep attractor network

A Separation and Interaction Framework for Causal Multi-Channel Speech Enhancement.

Multi-layer Attention Mechanism Based Speech Separation Model.

Multi-Dimensional and Multi-Scale Modeling for Speech Separation Optimized by Discriminative Learning

Knowledge Transfer in Permutation Invariant Training for Single-Channel Multi-Talker Speech Recognition.

Separating Long-Form Speech with Group-wise Permutation Invariant Training.

Monaural Multi-Talker Speech Recognition With Attention Mechanism And Gated Convolutional Networks

Multi-Stream Gated and Pyramidal Temporal Convolutional Neural Networks for Audio-Visual Speech Separation in Multi-Talker Environments

Source-Aware Context Network for Single-Channel Multi-Speaker Speech Separation.

Single-channel speech separation using Soft-minimum Permutation Invariant Training

Guided Training: A Simple Method for Single-channel Speaker Separation

Boosting Spatial Information for Deep Learning Based Multichannel Speaker-Independent Speech Separation in Reverberant Environments.

Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features