Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

Simon Malan,Benjamin van Niekerk,Herman Kamper
2024-09-22
Abstract:We look at the long-standing problem of segmenting unlabeled speech into word-like segments and clustering these into a lexicon. Several previous methods use a scoring model coupled with dynamic programming to find an optimal segmentation. Here we propose a much simpler strategy: we predict word boundaries using the dissimilarity between adjacent self-supervised features, then we cluster the predicted segments to construct a lexicon. For a fair comparison, we update the older ES-KMeans dynamic programming method with better features and boundary constraints. On the five-language ZeroSpeech benchmarks, our simple approach gives similar state-of-the-art results compared to the new ES-KMeans+ method, while being almost five times faster.
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to segment unlabeled speech streams into word - like segments and cluster these segments to build a vocabulary. Specifically, the author focuses on the problems of unsupervised word segmentation and vocabulary learning, that is, how to automatically identify word boundaries from continuous speech signals and build a vocabulary without any labels. ### Main research problems: 1. **Unsupervised word segmentation**: How to segment continuous speech audio into word - like segments without labels. 2. **Vocabulary building**: How to cluster the segmented segments to form an effective vocabulary. ### Research background: - Speech is a continuous stream, and there are usually no obvious pauses to distinguish words. - Building a vocabulary poses another challenge because there are differences between different speakers, and even the same speaker can show great variation. - Human infants can demonstrate word discrimination and recognition abilities within the first year, which provides biological inspiration for research. ### Solutions: The author proposes a simple method, which is achieved through the following steps: 1. **Predicting word boundaries**: Use the dissimilarity between adjacent self - supervised features to predict word boundaries. The specific method is to calculate the cosine distance between adjacent frames and determine the boundaries through smoothing. 2. **Clustering to build a vocabulary**: Perform K - means clustering on the predicted word segments to build a vocabulary. ### Comparison and improvement: - The author compared this method with the latest dynamic - programming - based methods (such as ES - KMeans) and updated the ES - KMeans method (called ES - KMeans+) to use better features and boundary constraints. - In the ZeroSpeech benchmark test of five languages, the author's method achieved results comparable to ES - KMeans+, but was almost five times faster. ### Formula representation: - The formula for calculating the cosine distance between adjacent frames is: \[ f_t = d(y_{t + 1}, y_t) \] where \( y_t \) represents the feature vector of the \( t\) - th frame, and \( d \) is the cosine distance function. - The formula for calculating the average embedding vector is: \[ z_i = g(x_{t_1:t_2})=\frac{1}{t_2 - t_1 + 1}\sum_{t = t_1}^{t_2}x_t \] where \( x_{t_1:t_2} \) represents the predicted word segment, and \( g \) represents the operation of averaging and normalizing to the unit sphere. Through these methods, the author shows the possibility of significantly improving efficiency while maintaining performance.