Abstract:Tokenization algorithms that merge the units of a base vocabulary into larger, variable-rate units have become standard in natural language processing tasks. This idea, however, has been mostly overlooked when the vocabulary consists of phonemes or Discrete Acoustic Units (DAUs), an audio-based representation that is playing an increasingly important role due to the success of discrete language-modeling techniques. In this paper, we showcase the advantages of tokenization of phonetic units and of DAUs on three prediction tasks: grapheme-to-phoneme, grapheme-to-DAUs, and unsupervised speech generation using DAU language modeling. We demonstrate that tokenization yields significant improvements in terms of performance, as well as training and inference speed, across all three tasks. We also offer theoretical insights to provide some explanation for the superior performance observed.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to explore the benefits brought by the Tokenization algorithm when dealing with Discrete Acoustic Units (DAUs) and phonemes. Specifically, the author aims to show the advantages of Tokenization in the following three tasks: 1. **Grapheme - to - Phoneme (G2P)**: Predict the phoneme sequence corresponding to the character sequence in the text. 2. **Grapheme - to - DAUs (G2DAU)**: Predict the discrete acoustic unit sequence corresponding to the character sequence in the text. 3. **SpeechLM (Unsupervised Speech Generation using DAU Language Model)**: Generate speech based on the DAU - based language model. By introducing Tokenization algorithms such as Byte Pair Encoding (BPE), the paper shows significant improvements in these methods in terms of task performance, training speed, and inference speed. In addition, the author also provides theoretical insights to explain why BPE can bring better performance. ### Specific Problem Description 1. **Redundancy and Predictability**: There is redundancy and predictability in phoneme and DAU sequences, so these sequences can be compressed by Tokenization algorithms. 2. **Computational Complexity**: Traditional Transformer models have high computational complexity when dealing with long sequences, especially when dealing with audio data. Reducing the sequence length through Tokenization can significantly reduce the computational complexity. 3. **Data Imbalance**: The data distribution in the original vocabulary may be unbalanced, affecting the model training effect. BPE balances the data distribution by merging frequently occurring unit pairs. ### Main Contributions 1. **Quantifying Compression Effects**: Evaluate the compression effects of BPE on discrete audio and phoneme units. 2. **Performance Improvement**: Show that BPE significantly improves performance metrics in G2P, G2DAU, and SpeechLM tasks, and accelerates training and inference speeds. 3. **Alleviating Data Imbalance**: Analyze how BPE alleviates the data imbalance problem and reduces the sequence length in autoregressive models. ### Method Overview - **Base Unit Construction**: Extract DAU and phoneme sequences. - **Byte Pair Encoding (BPE)**: Apply the BPE algorithm to Tokenize the original vocabulary and generate a new vocabulary. - **Experimental Setup**: Use the Transformer model to conduct experiments on the above three tasks and compare the performance differences before and after BPE. ### Experimental Results The experimental results show that after applying BPE in all three tasks: - The sequence length is significantly reduced. - The training and inference speeds are accelerated. - Performance metrics (such as WER, CER, BLEU, ROUGE, etc.) are significantly improved. In conclusion, this paper proves the effectiveness of Tokenization algorithms in dealing with speech - related tasks through experiments, providing a valuable reference for future research and applications.

Exploring the Benefits of Tokenization of Discrete Acoustic Units

Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding

DASB -- Discrete Audio and Speech Benchmark

Predicting positive transfer for improved low-resource speech recognition using acoustic pseudo-tokens

How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

dMel: Speech Tokenization made Simple

Children's Speech Recognition through Discrete Token Enhancement

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Discrete Audio Representation as an Alternative to Mel-Spectrograms for Speaker and Speech Recognition

Acoustic BPE for Speech Generation with Discrete Tokens

Accelerating Transducers through Adjacent Token Merging

Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer

Single-stage TTS with Masked Audio Token Modeling and Semantic Knowledge Distillation

LAST: Language Model Aware Speech Tokenization

Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study

Getting the most out of your tokenizer for pre-training and domain adaptation

TokenVerse: Towards Unifying Speech and NLP Tasks via Transducer-based ASR

Tokenization Is More Than Compression

Discrete Unit based Masking for Improving Disentanglement in Voice Conversion

DM-Codec: Distilling Multimodal Representations for Speech Tokenization