Learning to Rank Context for Named Entity Recognition Using a Synthetic Dataset

Arthur Amalvy,Vincent Labatut,Richard Dufour
2024-04-08
Abstract:While recent pre-trained transformer-based models can perform named entity recognition (NER) with great accuracy, their limited range remains an issue when applied to long documents such as whole novels. To alleviate this issue, a solution is to retrieve relevant context at the document level. Unfortunately, the lack of supervision for such a task means one has to settle for unsupervised approaches. Instead, we propose to generate a synthetic context retrieval training dataset using Alpaca, an instructiontuned large language model (LLM). Using this dataset, we train a neural context retriever based on a BERT model that is able to find relevant context for NER. We show that our method outperforms several retrieval baselines for the NER task on an English literary dataset composed of the first chapter of 40 books.
Computation and Language
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the scope limitation problem encountered in named entity recognition (NER) in long documents (such as entire novels). Specifically, current NER methods based on pre - trained Transformer models (such as BERT, etc.) perform poorly when dealing with long documents because these models are limited by the quadratic complexity of the attention mechanism and cannot effectively utilize the global document - level context. This leads to insufficient entity disambiguation ability, thus affecting the performance of the NER task. To solve this problem, the author proposes the following solutions: 1. **Generate a synthetic context retrieval dataset**: Due to the lack of supervised data, traditional supervised learning methods are difficult to apply. Therefore, the author uses Alpaca (an instruction - fine - tuned large - scale language model, LLM) to generate a synthetic context retrieval training dataset. This dataset is used to train a neural context retriever to help the NER model find relevant context. 2. **Train the neural context retriever**: Through the above - mentioned synthetic dataset, the author trains a neural context retriever based on the BERT model. This retriever can find relevant context helpful for the NER task according to the given input text. 3. **Evaluation and comparison**: The author conducts experiments on an English literature dataset consisting of the first chapters of 40 books, evaluates the proposed method, and compares it with multiple unsupervised retrieval baseline methods. The results show that this method is superior to other unsupervised methods and, in some cases, even outperforms the re - ranker trained with manually annotated data. ### Main contributions - Propose a method of training a neural context retriever through a synthetic dataset, which solves the scope limitation problem of the NER task in long documents. - Experiments prove that this method not only improves NER performance but also can be comparable to or even outperform models trained with manually annotated data in some cases. - Provide detailed experimental settings and result analysis, providing a reference for further research. ### Formula explanation The paper does not involve complex mathematical, physical, chemical, or biological formulas. However, to ensure the correctness and readability of all formulas, if there is a need to express formulas, they will be presented in Markdown format. For example: \[ F1 = 2\times\frac{Precision\times Recall}{Precision + Recall} \] where: - \( Precision \) represents the precision rate - \( Recall \) represents the recall rate This format can ensure the clarity and ease of understanding of the formulas.