Exploring the Benefits of Tokenization of Discrete Acoustic Units

Avihu Dekel,Raul Fernandez
2024-06-09
Abstract:Tokenization algorithms that merge the units of a base vocabulary into larger, variable-rate units have become standard in natural language processing tasks. This idea, however, has been mostly overlooked when the vocabulary consists of phonemes or Discrete Acoustic Units (DAUs), an audio-based representation that is playing an increasingly important role due to the success of discrete language-modeling techniques. In this paper, we showcase the advantages of tokenization of phonetic units and of DAUs on three prediction tasks: grapheme-to-phoneme, grapheme-to-DAUs, and unsupervised speech generation using DAU language modeling. We demonstrate that tokenization yields significant improvements in terms of performance, as well as training and inference speed, across all three tasks. We also offer theoretical insights to provide some explanation for the superior performance observed.
Sound,Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to explore the benefits brought by the Tokenization algorithm when dealing with Discrete Acoustic Units (DAUs) and phonemes. Specifically, the author aims to show the advantages of Tokenization in the following three tasks: 1. **Grapheme - to - Phoneme (G2P)**: Predict the phoneme sequence corresponding to the character sequence in the text. 2. **Grapheme - to - DAUs (G2DAU)**: Predict the discrete acoustic unit sequence corresponding to the character sequence in the text. 3. **SpeechLM (Unsupervised Speech Generation using DAU Language Model)**: Generate speech based on the DAU - based language model. By introducing Tokenization algorithms such as Byte Pair Encoding (BPE), the paper shows significant improvements in these methods in terms of task performance, training speed, and inference speed. In addition, the author also provides theoretical insights to explain why BPE can bring better performance. ### Specific Problem Description 1. **Redundancy and Predictability**: There is redundancy and predictability in phoneme and DAU sequences, so these sequences can be compressed by Tokenization algorithms. 2. **Computational Complexity**: Traditional Transformer models have high computational complexity when dealing with long sequences, especially when dealing with audio data. Reducing the sequence length through Tokenization can significantly reduce the computational complexity. 3. **Data Imbalance**: The data distribution in the original vocabulary may be unbalanced, affecting the model training effect. BPE balances the data distribution by merging frequently occurring unit pairs. ### Main Contributions 1. **Quantifying Compression Effects**: Evaluate the compression effects of BPE on discrete audio and phoneme units. 2. **Performance Improvement**: Show that BPE significantly improves performance metrics in G2P, G2DAU, and SpeechLM tasks, and accelerates training and inference speeds. 3. **Alleviating Data Imbalance**: Analyze how BPE alleviates the data imbalance problem and reduces the sequence length in autoregressive models. ### Method Overview - **Base Unit Construction**: Extract DAU and phoneme sequences. - **Byte Pair Encoding (BPE)**: Apply the BPE algorithm to Tokenize the original vocabulary and generate a new vocabulary. - **Experimental Setup**: Use the Transformer model to conduct experiments on the above three tasks and compare the performance differences before and after BPE. ### Experimental Results The experimental results show that after applying BPE in all three tasks: - The sequence length is significantly reduced. - The training and inference speeds are accelerated. - Performance metrics (such as WER, CER, BLEU, ROUGE, etc.) are significantly improved. In conclusion, this paper proves the effectiveness of Tokenization algorithms in dealing with speech - related tasks through experiments, providing a valuable reference for future research and applications.