NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts

Yen-Ting Lin,Chao-Han Huck Yang,Zhehuai Chen,Piotr Zelasko,Xuesong Yang,Zih-Ching Chen,Krishna C Puvvada,Szu-Wei Fu,Ke Hu,Jun Wei Chiu,Jagadeesh Balam,Boris Ginsburg,Yu-Chiang Frank Wang
2024-11-09
Abstract:Construction of a general-purpose post-recognition error corrector poses a crucial question: how can we most effectively train a model on a large mixture of domain datasets? The answer would lie in learning dataset-specific features and digesting their knowledge in a single model. Previous methods achieve this by having separate correction language models, resulting in a significant increase in parameters. In this work, we present Mixture-of-Experts as a solution, highlighting that MoEs are much more than a scalability tool. We propose a Multi-Task Correction MoE, where we train the experts to become an ``expert'' of speech-to-text, language-to-text and vision-to-text datasets by learning to route each dataset's tokens to its mapped expert. Experiments on the Open ASR Leaderboard show that we explore a new state-of-the-art performance by achieving an average relative $5.0$% WER reduction and substantial improvements in BLEU scores for speech and translation tasks. On zero-shot evaluation, NeKo outperforms GPT-3.5 and Claude-Opus with $15.5$% to $27.6$% relative WER reduction in the Hyporadise benchmark. NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
Computation and Language,Artificial Intelligence,Machine Learning,Multiagent Systems,Audio and Speech Processing
What problem does this paper attempt to address?
### The problems the paper attempts to solve What this paper attempts to solve is how to effectively train a general post - recognition error - correction model so that it can handle data sets in multiple domains. Specifically, the paper proposes a multi - task correction model - **NEKO**, aiming to improve the post - recognition results of speech, text and visual inputs through the Mixture - of - Experts (MoE) method. ### Main problems and background 1. **Multi - modal post - recognition correction**: - Humans have strong capabilities in multiple modalities such as speech recognition, visual pattern recognition, and semantic and text interpretation, but these capabilities are not perfect and misrecognition errors often occur. - Despite these misrecognitions, humans can still communicate efficiently using speech, language or facial expressions, even when the conversation contains inaccurate vocabulary and ambiguous accents. 2. **Limitations of existing methods**: - Traditional post - recognition correction methods usually rely on separate correction language models, which leads to a significant increase in the number of parameters. - Fine - tuning large language models (LLMs) directly on a variety of different error - correction data sets will lead to sub - optimal performance because different data sets differ in input modalities, output formats, error types and domain characteristics. 3. **Advantages of Mixture - of - Experts (MoE)**: - The Mixture - of - Experts method learns to route inputs to the most appropriate experts through multiple expert networks and a gating network (routing network), thereby achieving more specialized and fine - grained representations. - This method allows the model to share knowledge between different tasks while capturing the specific characteristics of each task. ### Main contributions of the paper 1. **Introduction of NEKO**: - Proposed a large - scale language model (LLM) for multi - task error - correction, using a task - oriented Mixture - of - Experts (MoE) method to handle multiple post - recognition correction tasks. - To the best of the authors' knowledge, this is the first work to explore the use of MoE for multi - task error - correction. 2. **Cross - modal post - recognition correction evaluation**: - In the new cross - modal post - recognition correction evaluation, NEKO performs excellently as an open - source ASR, ST, OCR and TEC baseline model. - Experimental results show that NEKO has reached a new state - of - the - art level as a multi - task error - correction model in the ASR task. 3. **Emerging capabilities of cross - task correction**: - Discovered the emerging capabilities of NEKO in cross - task correction. This is the first such multi - task correction method, providing a new direction for the design of general post - recognition language models. 4. **Open - source plan**: - Plans to open - source the NEKO model, the newly created data set and the training process under the CC BY - SA 4.0 license to support reproducibility and encourage future research. ### Experimental results 1. **ASR task**: - Experiments on the Open ASR Leaderboard show that NEKO has achieved an average 5.0% reduction in WER on multiple data sets. - In particular, on more challenging data sets such as AMI (conversational speech) and VoxPopuli (accented speech), the performance improvement is significant. 2. **ST and MT tasks**: - Experiments on the HypoTranslate data set show that NEKO performs excellently in both zero - shot and few - shot settings, with relative WER reductions ranging from 15.5% to 27.6%. - In the Japanese and Chinese machine translation tasks of WMT'20, NEKO also shows competitiveness. 3. **OCR task**: - Experiments on the Post - OCR Correction data set show that NEKO also has significant improvements in OCR error - correction. 4. **TEC task**: - Experiments on the CoEdIT data set show that NEKO performs excellently in grammar - correction and coherence - improvement tasks, verifying its effectiveness in handling text - editing instructions. ### Conclusion NEKO has achieved significant performance improvements in multi - modal post - recognition correction tasks through the task - oriented Mixture - of - Experts method, demonstrating its strong ability to handle multiple tasks and domain data sets.