Abstract:In this study, we aim to reduce generation latency for Named Entity Recognition (NER) with Large Language Models (LLMs). The main cause of high latency in LLMs is the sequential decoding process, which autoregressively generates all labels and mentions for NER, significantly increase the sequence length. To this end, we introduce Parallel Decoding in LLM for NE} (PaDeLLM-NER), a approach that integrates seamlessly into existing generative model frameworks without necessitating additional modules or architectural modifications. PaDeLLM-NER allows for the simultaneous decoding of all mentions, thereby reducing generation latency. Experiments reveal that PaDeLLM-NER significantly increases inference speed that is 1.76 to 10.22 times faster than the autoregressive approach for both English and Chinese. Simultaneously it maintains the quality of predictions as evidenced by the performance that is on par with the state-of-the-art across various datasets.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to reduce the generation latency of large language models (LLMs) in the named entity recognition (NER) task. The main problem is that existing LLMs adopt a sequential decoding process in NER, which generates all labels and entity mentions autoregressively, significantly increasing the sequence length and thus resulting in high latency. To address this challenge, the authors propose **Parallel Decoding for NER in LLMs (PaDeLLM - NER)**, a method that can be seamlessly integrated into the existing generative model framework without the need for additional modules or architectural modifications. PaDeLLM - NER allows all entity mentions to be decoded simultaneously, thereby reducing the generation latency. ### Main contributions 1. **Propose PaDeLLM - NER**: A new NER method that can predict all label - entity pairs in parallel, effectively reducing the inference latency. 2. **Experimental verification**: Through extensive experiments, it is shown that PaDeLLM - NER significantly improves the inference efficiency, with the average sequence length reduced by approximately 87%, and the inference speed is 1.76 to 10.22 times faster than the traditional autoregressive method. 3. **Maintain or improve prediction quality**: PaDeLLM - NER not only improves the inference speed but also maintains or exceeds the prediction quality of the traditional autoregressive method on multiple datasets, achieving performance comparable to the existing state - of - the - art methods. ### Method overview #### 3.1 Reconstruction of instruction fine - tuning - **Input text segmentation**: A single unstructured text containing all label - entity pairs is segmented into multiple sequences. The output of each new sequence includes the number of mentions of the specified label (“entity type”), and the n - th mention of this label (“<mention n>”). - **Training objective**: Optimization is carried out using the cross - entropy loss function. The loss calculation starts from the number of mentions, ignoring the loss calculation of the “<mention n>” part because these parts are additional during inference and do not need to be generated. #### 3.2 Inference of label - entity pairs - **Two - step inference**: 1. **Predict the number of mentions**: Based on the label prompt, predict the number of mentions of each label in the input. 2. **Predict entity mentions**: According to the predicted number of mentions, generate the entity mentions of each label. The decoding of all label - entity pairs is carried out in parallel, which can be achieved by using multiple GPUs or batch inference on a single GPU. #### 3.3 Deduplication of duplicate entities - **Deduplication strategy**: Since parallel decoding may cause the same entity to appear repeatedly in different labels, the prediction probability is used to remove duplicate entities. Specifically, the prediction probability of each entity instance is calculated, and the instance with the highest probability is retained. ### Experimental results #### 4.1 Experimental setup - **Datasets**: Include NER datasets in English and Chinese, such as CoNLL2003, ACE2005, GENIA, Weibo, MSRA, etc. - **Model setup**: Use pre - trained Llama2 - 7b and Baichuan2 - 7b as base models. - **Baseline methods**: Compare with traditional autoregressive methods (such as AutoReg Aug and AutoReg Struct), and compare with other latest state - of - the - art methods (such as BINDER, Gollie, DeepStruct, etc.). #### 4.2 Main results - **Latency evaluation**: PaDeLLM - NER significantly reduces the inference latency, especially on the Weibo dataset, where the speed is improved by 10.22 times compared with AutoReg Struct. - **Prediction quality evaluation**: PaDeLLM - NER achieves the highest average F1 score (84.79) on multiple datasets and performs particularly well on the Weibo, Youku, and ACE2005 datasets. ### Conclusion PaDeLLM - NER significantly reduces the inference latency in the NER task through parallel decoding while maintaining or improving the prediction quality. This method performs well on multiple datasets and provides a new solution for efficient inference in the NER task.

PaDeLLM-NER: Parallel Decoding in Large Language Models for Named Entity Recognition

Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding

Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs

Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding

M-Ped: Multi-Prompt Ensemble Decoding for Large Language Models

APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding

Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference

Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

Nearest Neighbor Speculative Decoding for LLM Generation and Attribution

Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding

Adaptive Draft-Verification for Efficient Large Language Model Decoding

Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration

GEIC: Universal and Multilingual Named Entity Recognition with Large Language Models

When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Using Large Language Model for End-to-End Chinese ASR and NER

Cerberus: Efficient Inference with Adaptive Parallel Decoding and Sequential Knowledge Enhancement

CLLMs: Consistency Large Language Models

Tandem Transformers for Inference Efficient LLMs

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding