Abstract:Speech signal is constituted and contributed by various informative factors, such as linguistic content and speaker characteristic. There have been notable recent studies attempting to factorize speech signal into these individual factors without requiring any annotation. These studies typically assume continuous representation for linguistic content, which is not in accordance with general linguistic knowledge and may make the extraction of speaker information less successful. This paper proposes the mixture factorized auto-encoder (mFAE) for unsupervised deep factorization. The encoder part of mFAE comprises a frame tokenizer and an utterance embedder. The frame tokenizer models linguistic content of input speech with a discrete categorical distribution. It performs frame clustering by assigning each frame a soft mixture label. The utterance embedder generates an utterance-level vector representation. A frame decoder serves to reconstruct speech features from the encoders'outputs. The mFAE is evaluated on speaker verification (SV) task and unsupervised subword modeling (USM) task. The SV experiments on VoxCeleb 1 show that the utterance embedder is capable of extracting speaker-discriminative embeddings with performance comparable to a x-vector baseline. The USM experiments on ZeroSpeech 2017 dataset verify that the frame tokenizer is able to capture linguistic content and the utterance embedder can acquire speaker-related information.

Competing Speaker Count Estimation on the Fusion of the Spectral and Spatial Embedding Space.

Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network.

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Learning-based Robust Speaker Counting and Separation with the Aid of Spatial Coherence

Fusion of Embeddings Networks for Robust Combination of Text Dependent and Independent Speaker Recognition

Fusion of deep shallow features and models for speaker recognition

Deep Speaker: an End-to-End Neural Speaker Embedding System

Gated Recurrent Fusion of Spatial and Spectral Features for Multi-Channel Speech Separation with Deep Embedding Representations.

Simultaneous Denoising and Dereverberation Using Deep Embedding Features

A novel speech feature fusion algorithm for text-independent speaker recognition

Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker Verification

SpatialNet: Extensively Learning Spatial Information for Multichannel Joint Speech Separation, Denoising and Dereverberation

Incorporating Uncertainty from Speaker Embedding Estimation to Speaker Verification

EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers

Supervised Speaker Embedding De-Mixing in Two-Speaker Environment

Joint Training for Simultaneous Speech Denoising and Dereverberation with Deep Embedding Representations

Improved Speech Separation with Time-and-Frequency Cross-domain Joint Embedding and Clustering

Beamforming and Deep Models Integrated Multi-talker Speech Separation

Mixture factorized auto-encoder for unsupervised hierarchical deep factorization of speech signal

NPU Speaker Verification System for INTERSPEECH 2020 Far-Field Speaker Verification Challenge

Exploiting Speaker Embeddings for Improved Microphone Clustering and Speech Separation in ad-hoc Microphone Arrays