Abstract:This paper presents our developed decoder which adopts the idea of statically optimizing part of the knowledge sources while handling the others dynamically. The lexicon, phonetic contexts and acoustic model are statically integrated to form a memory-efficient state network, while the language model (LM) is dynamically incorporated on the fly by means of extended tokens. The novelties of our approach for constructing the state network are (1) introducing two layers of dummy nodes to cluster the cross-word (CW) context dependent fan-in and fan-out triphones, (2) introducing a so-called “WI layer” to store the word identities and putting the nodes of this layer in the non-shared mid-part of the network, (3) optimizing the network at state level by a sufficient forward and backward node-merge process. The state network is organized as a multi-layer structure for distinct token propagation at each layer. By exploiting the characteristics of the state network, several techniques including LM look-ahead, LM cache and beam pruning are specially designed for search efficiency. Especially in beam pruning, a layer-dependent pruning method is proposed to further reduce the search space. The layer-dependent pruning takes account of the neck-like characteristics of WI layer and the reduced variety of word endings, which enables tighter beam without introducing much search errors. In addition, other techniques including LM compression, lattice-based bookkeeping and lattice garbage collection are also employed to reduce the memory requirements. Experiments are carried out on a Mandarin spontaneous speech recognition task where the decoder involves a trigram LM and CW triphone models. A comparison with HDecode of HTK toolkits shows that, within 1% performance deviation, our decoder can run 5 times faster with half of the memory footprint.

Fast Language Model Look-ahead Algorithm Using Extended N -Gram Model

Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding

Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy

Reducing time-synchronous beam search effort using stage based look-ahead and language model rank based pruning

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

PaDeLLM-NER: Parallel Decoding in Large Language Models for Named Entity Recognition

Speech Recognition Lattice-Generating Algorithm with Forward-Backward Language Model

Empirically Combining Unnormalized NNLM and Back-off N -Gram for Fast N -Best Rescoring in Speech Recognition

Improving Mandarin End-to-End Speech Recognition with Word N-gram Language Model

A One-Pass Real-Time Decoder Using Memory-Efficient State Network

Faster Speech-LLaMA Inference with Multi-token Prediction

Efficient One-Pass Decoding with Nnlm for Speech Recognition

Adaptive Draft-Verification for Efficient Large Language Model Decoding

Construction of a compact dynamic decoder network for large vocabulary continuous speech recognition

Efficient representation and fast look-up of Maximum Entropy language models.

Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

Interpretable Language Modeling via Induction-head Ngram Models

Adaptive Skeleton Graph Decoding

Nearest Neighbor Speculative Decoding for LLM Generation and Attribution

Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding

Inference with Reference: Lossless Acceleration of Large Language Models