Abstract:This paper proposes a new source model and training scheme to improve the accuracy and speed of the multichannel variational autoencoder (MVAE) method. The MVAE method is a recently proposed powerful multichannel source separation method. It consists of pretraining a source model represented by a conditional VAE (CVAE) and then estimating separation matrices along with other unknown parameters so that the log-likelihood is non-decreasing given an observed mixture signal. Although the MVAE method has been shown to provide high source separation performance, one drawback is the computational cost of the backpropagation steps in the separation-matrix estimation algorithm. To overcome this drawback, a method called "FastMVAE" was subsequently proposed, which uses an auxiliary classifier VAE (ACVAE) to train the source model. By using the classifier and encoder trained in this way, the optimal parameters of the source model can be inferred efficiently, albeit approximately, in each step of the algorithm. However, the generalization capability of the trained ACVAE source model was not satisfactory, which led to poor performance in situations with unseen data. To improve the generalization capability, this paper proposes a new model architecture (called the "ChimeraACVAE" model) and a training scheme based on knowledge distillation. The experimental results revealed that the proposed source model trained with the proposed loss function achieved better source separation performance with less computation time than FastMVAE. We also confirmed that our methods were able to separate 18 sources with a reasonably good accuracy.

Joint Separation and Dereverberation of Reverberant Mixtures with Multichannel Variational Autoencoder

Supervised Determined Source Separation with Multichannel Variational Autoencoder.

FastMVAE2: On improving and accelerating the fast variational autoencoder-based source separation algorithm for determined mixtures

Multichannel Variational Autoencoder-Based Speech Separation in Designated Speaker Order

Integration of variational autoencoder and spatial clustering for adaptive multi-channel neural speech separation

Joint Online Multichannel Acoustic Echo Cancellation, Speech Dereverberation and Source Separation

Underdetermined Blind Source Separation in Reverberant Environment

On Joint Dereverberation and Source Separation with Geometrical Constraints and Iterative Source Steering

Reverberant Speech Separation with Probabilistic Time-Frequency Masking for B-format Recordings.

A Multichannel Learning-Based Approach for Sound Source Separation in Reverberant Environments

Mixture of Dynamical Variational Autoencoders for Multi-Source Trajectory Modeling and Separation

Multimodal Variational Auto-encoder based Audio-Visual Segmentation

Audio-visual multi-channel speech separation, dereverberation and recognition

Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation

End-to-End Multi-speaker ASR with Independent Vector Analysis

On End-to-end Multi-channel Time Domain Speech Separation in Reverberant Environments

A multimodal dynamical variational autoencoder for audiovisual speech representation learning

Underdetermined Reverberant Audio-Source Separation Through Improved Expectation–Maximization Algorithm

Successive Multivariate Variational Mode Decomposition Based on Instantaneous Linear Mixing Model

A Blind Channel Identification-Based Two-Stage Approach to Separation and Dereverberation of Speech Signals in a Reverberant Environment

Variational bayesian method for temporally correlated source separation