Abstract:We look at the long-standing problem of segmenting unlabeled speech into word-like segments and clustering these into a lexicon. Several previous methods use a scoring model coupled with dynamic programming to find an optimal segmentation. Here we propose a much simpler strategy: we predict word boundaries using the dissimilarity between adjacent self-supervised features, then we cluster the predicted segments to construct a lexicon. For a fair comparison, we update the older ES-KMeans dynamic programming method with better features and boundary constraints. On the five-language ZeroSpeech benchmarks, our simple approach gives similar state-of-the-art results compared to the new ES-KMeans+ method, while being almost five times faster.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to segment unlabeled speech streams into word - like segments and cluster these segments to build a vocabulary. Specifically, the author focuses on the problems of unsupervised word segmentation and vocabulary learning, that is, how to automatically identify word boundaries from continuous speech signals and build a vocabulary without any labels. ### Main research problems: 1. **Unsupervised word segmentation**: How to segment continuous speech audio into word - like segments without labels. 2. **Vocabulary building**: How to cluster the segmented segments to form an effective vocabulary. ### Research background: - Speech is a continuous stream, and there are usually no obvious pauses to distinguish words. - Building a vocabulary poses another challenge because there are differences between different speakers, and even the same speaker can show great variation. - Human infants can demonstrate word discrimination and recognition abilities within the first year, which provides biological inspiration for research. ### Solutions: The author proposes a simple method, which is achieved through the following steps: 1. **Predicting word boundaries**: Use the dissimilarity between adjacent self - supervised features to predict word boundaries. The specific method is to calculate the cosine distance between adjacent frames and determine the boundaries through smoothing. 2. **Clustering to build a vocabulary**: Perform K - means clustering on the predicted word segments to build a vocabulary. ### Comparison and improvement: - The author compared this method with the latest dynamic - programming - based methods (such as ES - KMeans) and updated the ES - KMeans method (called ES - KMeans+) to use better features and boundary constraints. - In the ZeroSpeech benchmark test of five languages, the author's method achieved results comparable to ES - KMeans+, but was almost five times faster. ### Formula representation: - The formula for calculating the cosine distance between adjacent frames is: \[ f_t = d(y_{t + 1}, y_t) \] where \( y_t \) represents the feature vector of the \( t\) - th frame, and \( d \) is the cosine distance function. - The formula for calculating the average embedding vector is: \[ z_i = g(x_{t_1:t_2})=\frac{1}{t_2 - t_1 + 1}\sum_{t = t_1}^{t_2}x_t \] where \( x_{t_1:t_2} \) represents the predicted word segment, and \( g \) represents the operation of averaging and normalizing to the unit sphere. Through these methods, the author shows the possibility of significantly improving efficiency while maintaining performance.

Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

An embedded segmental K-means model for unsupervised segmentation and clustering of speech

Word Segmentation on Discovered Phone Units with Dynamic Programming and Self-Supervised Scoring

Revisiting speech segmentation and lexicon learning with better features

A new DP-like speaker clustering algorithm

Unsupervised Word Segmentation Using Temporal Gradient Pseudo-Labels

Unsupervised Spoken Term Discovery Based on Re-clustering of Hypothesized Speech Segments with Siamese and Triplet Networks

Self Supervised Representation Learning with Deep Clustering for Acoustic Unit Discovery from Raw Speech

Low-Latency Deep Clustering For Speech Separation

XLS-R fine-tuning on noisy word boundaries for unsupervised speech segmentation into words

Spoken-Term Discovery using Discrete Speech Units

Back to Supervision: Boosting Word Boundary Detection through Frame Classification

Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation

Word Discovery in Visually Grounded, Self-Supervised Speech Models

A statistical learning algorithm for word segmentation

Unsupervised Spoken Term Discovery on Untranscribed Speech

Analysis of Deep Clustering as Preprocessing for Automatic Speech Recognition of Sparsely Overlapping Speech

Enhanced Streaming Based Subspace Clustering Applied to Acoustic Scene Data Clustering

Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model

Revisiting clustering for efficient unsupervised dialogue structure induction

Speaker Segmentation and Clustering Based on the Improved Spectral Clustering