Unsupervised Dialogue Topic Segmentation in Hyperdimensional Space

Seongmin Park,Jinkyu Seo,Jihwa Lee
DOI: https://doi.org/10.21437/Interspeech.2023-1859
2023-08-21
Abstract:We present HyperSeg, a hyperdimensional computing (HDC) approach to unsupervised dialogue topic segmentation. HDC is a class of vector symbolic architectures that leverages the probabilistic orthogonality of randomly drawn vectors at extremely high dimensions (typically over 10,000). HDC generates rich token representations through its low-cost initialization of many unrelated vectors. This is especially beneficial in topic segmentation, which often operates as a resource-constrained pre-processing step for downstream transcript understanding tasks. HyperSeg outperforms the current state-of-the-art in 4 out of 5 segmentation benchmarks -- even when baselines are given partial access to the ground truth -- and is 10 times faster on average. We show that HyperSeg also improves downstream summarization accuracy. With HyperSeg, we demonstrate the viability of HDC in a major language task. We open-source HyperSeg to provide a strong baseline for unsupervised topic segmentation.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the issue of topic segmentation in automatic speech recognition (ASR) transcribed texts. Specifically, the paper proposes a new framework called HyperSeg, which utilizes high-dimensional computing (HDC) technology for unsupervised dialogue topic segmentation. The main goal is to overcome the fragility of existing methods on different domain datasets and their high dependency on hyperparameters, while also improving segmentation speed and the accuracy of downstream tasks such as summary generation. The main contributions of the paper include: 1. **Proposing the HyperSeg framework**: For the first time, HDC is applied to the topic segmentation task, generating more robust and semantically coherent sentence embeddings. 2. **Outperforming existing methods**: In multiple benchmarks, HyperSeg outperforms the current best unsupervised segmentation algorithms, even when these baseline methods are provided with optimal hyperparameters or partial ground truth label information. 3. **Significantly improving processing speed**: Compared to neural network-based methods, HyperSeg is approximately 10 times faster and runs entirely on the CPU. 4. **Improving downstream task performance**: By using HyperSeg for topic segmentation, the accuracy of downstream tasks such as text summarization can be significantly improved. Through these improvements, HyperSeg not only enhances the performance of the segmentation task itself but also demonstrates its efficiency and flexibility in practical applications.