Universal Background Sparse Coding and Multilayer Bootstrap Network for Speaker Clustering

Xiao-Lei Zhang
DOI: https://doi.org/10.21437/interspeech.2016-65
2016-01-01
Abstract:In speaker recognition, Gaussian mixture model based universal background model is a standard for extracting high-dimensional supervectors, and factor-analysis-based i-vector is a recent state-of-the-art method for reducing the high-dimensional supervectors to low-dimensional representations. In this abstract paper, we propose an alternative to the aforementioned techniques by multilayer bootstrap networks (MBN). We first learn a high-dimensional sparse code for each frame by a universal background MBN, and then accumulate the sparse codes of the frames in a session (a.k.a. utterance) into a single high-dimensional sparse supervector. Finally, we reduce the session-level sparse supervectors to a low-dimensional subspace by MBN for unsupervised speaker clustering, or principle component analysis for supervised speaker classification. Our initial result on a small-scale problem demonstrates the effectiveness of the proposed method. Note that this abstract paper is used to protect the idea. A full version with large-scale experiments will be announced later.
What problem does this paper attempt to address?