Abstract:In this study, we aim to reduce generation latency for Named Entity Recognition (NER) with Large Language Models (LLMs). The main cause of high latency in LLMs is the sequential decoding process, which autoregressively generates all labels and mentions for NER, significantly increase the sequence length. To this end, we introduce Parallel Decoding in LLM for NE} (PaDeLLM-NER), a approach that integrates seamlessly into existing generative model frameworks without necessitating additional modules or architectural modifications. PaDeLLM-NER allows for the simultaneous decoding of all mentions, thereby reducing generation latency. Experiments reveal that PaDeLLM-NER significantly increases inference speed that is 1.76 to 10.22 times faster than the autoregressive approach for both English and Chinese. Simultaneously it maintains the quality of predictions as evidenced by the performance that is on par with the state-of-the-art across various datasets.
What problem does this paper attempt to address?
### Problems the paper attempts to solve
This paper aims to reduce the generation latency of large language models (LLMs) in the named entity recognition (NER) task. The main problem is that existing LLMs adopt a sequential decoding process in NER, which generates all labels and entity mentions autoregressively, significantly increasing the sequence length and thus resulting in high latency. To address this challenge, the authors propose **Parallel Decoding for NER in LLMs (PaDeLLM - NER)**, a method that can be seamlessly integrated into the existing generative model framework without the need for additional modules or architectural modifications. PaDeLLM - NER allows all entity mentions to be decoded simultaneously, thereby reducing the generation latency.
### Main contributions
1. **Propose PaDeLLM - NER**: A new NER method that can predict all label - entity pairs in parallel, effectively reducing the inference latency.
2. **Experimental verification**: Through extensive experiments, it is shown that PaDeLLM - NER significantly improves the inference efficiency, with the average sequence length reduced by approximately 87%, and the inference speed is 1.76 to 10.22 times faster than the traditional autoregressive method.
3. **Maintain or improve prediction quality**: PaDeLLM - NER not only improves the inference speed but also maintains or exceeds the prediction quality of the traditional autoregressive method on multiple datasets, achieving performance comparable to the existing state - of - the - art methods.
### Method overview
#### 3.1 Reconstruction of instruction fine - tuning
- **Input text segmentation**: A single unstructured text containing all label - entity pairs is segmented into multiple sequences. The output of each new sequence includes the number of mentions of the specified label (“entity type”), and the n - th mention of this label (“<mention n>”).
- **Training objective**: Optimization is carried out using the cross - entropy loss function. The loss calculation starts from the number of mentions, ignoring the loss calculation of the “<mention n>” part because these parts are additional during inference and do not need to be generated.
#### 3.2 Inference of label - entity pairs
- **Two - step inference**:
1. **Predict the number of mentions**: Based on the label prompt, predict the number of mentions of each label in the input.
2. **Predict entity mentions**: According to the predicted number of mentions, generate the entity mentions of each label. The decoding of all label - entity pairs is carried out in parallel, which can be achieved by using multiple GPUs or batch inference on a single GPU.
#### 3.3 Deduplication of duplicate entities
- **Deduplication strategy**: Since parallel decoding may cause the same entity to appear repeatedly in different labels, the prediction probability is used to remove duplicate entities. Specifically, the prediction probability of each entity instance is calculated, and the instance with the highest probability is retained.
### Experimental results
#### 4.1 Experimental setup
- **Datasets**: Include NER datasets in English and Chinese, such as CoNLL2003, ACE2005, GENIA, Weibo, MSRA, etc.
- **Model setup**: Use pre - trained Llama2 - 7b and Baichuan2 - 7b as base models.
- **Baseline methods**: Compare with traditional autoregressive methods (such as AutoReg Aug and AutoReg Struct), and compare with other latest state - of - the - art methods (such as BINDER, Gollie, DeepStruct, etc.).
#### 4.2 Main results
- **Latency evaluation**: PaDeLLM - NER significantly reduces the inference latency, especially on the Weibo dataset, where the speed is improved by 10.22 times compared with AutoReg Struct.
- **Prediction quality evaluation**: PaDeLLM - NER achieves the highest average F1 score (84.79) on multiple datasets and performs particularly well on the Weibo, Youku, and ACE2005 datasets.
### Conclusion
PaDeLLM - NER significantly reduces the inference latency in the NER task through parallel decoding while maintaining or improving the prediction quality. This method performs well on multiple datasets and provides a new solution for efficient inference in the NER task.