Abstract:While recent pre-trained transformer-based models can perform named entity recognition (NER) with great accuracy, their limited range remains an issue when applied to long documents such as whole novels. To alleviate this issue, a solution is to retrieve relevant context at the document level. Unfortunately, the lack of supervision for such a task means one has to settle for unsupervised approaches. Instead, we propose to generate a synthetic context retrieval training dataset using Alpaca, an instructiontuned large language model (LLM). Using this dataset, we train a neural context retriever based on a BERT model that is able to find relevant context for NER. We show that our method outperforms several retrieval baselines for the NER task on an English literary dataset composed of the first chapter of 40 books.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the scope limitation problem encountered in named entity recognition (NER) in long documents (such as entire novels). Specifically, current NER methods based on pre - trained Transformer models (such as BERT, etc.) perform poorly when dealing with long documents because these models are limited by the quadratic complexity of the attention mechanism and cannot effectively utilize the global document - level context. This leads to insufficient entity disambiguation ability, thus affecting the performance of the NER task. To solve this problem, the author proposes the following solutions: 1. **Generate a synthetic context retrieval dataset**: Due to the lack of supervised data, traditional supervised learning methods are difficult to apply. Therefore, the author uses Alpaca (an instruction - fine - tuned large - scale language model, LLM) to generate a synthetic context retrieval training dataset. This dataset is used to train a neural context retriever to help the NER model find relevant context. 2. **Train the neural context retriever**: Through the above - mentioned synthetic dataset, the author trains a neural context retriever based on the BERT model. This retriever can find relevant context helpful for the NER task according to the given input text. 3. **Evaluation and comparison**: The author conducts experiments on an English literature dataset consisting of the first chapters of 40 books, evaluates the proposed method, and compares it with multiple unsupervised retrieval baseline methods. The results show that this method is superior to other unsupervised methods and, in some cases, even outperforms the re - ranker trained with manually annotated data. ### Main contributions - Propose a method of training a neural context retriever through a synthetic dataset, which solves the scope limitation problem of the NER task in long documents. - Experiments prove that this method not only improves NER performance but also can be comparable to or even outperform models trained with manually annotated data in some cases. - Provide detailed experimental settings and result analysis, providing a reference for further research. ### Formula explanation The paper does not involve complex mathematical, physical, chemical, or biological formulas. However, to ensure the correctness and readability of all formulas, if there is a need to express formulas, they will be presented in Markdown format. For example: \[ F1 = 2\times\frac{Precision\times Recall}{Precision + Recall} \] where: - \( Precision \) represents the precision rate - \( Recall \) represents the recall rate This format can ensure the clarity and ease of understanding of the formulas.

Learning to Rank Context for Named Entity Recognition Using a Synthetic Dataset

LTNER: Large Language Model Tagging for Named Entity Recognition with Contextualized Entity Marking

Retrieval-Enhanced Named Entity Recognition

Learning In-context Learning for Named Entity Recognition

Named Entity Recognition by Using XLNet-BiLSTM-CRF

Learning from Context or Names? an Empirical Study on Neural Relation Extraction

Exploring Cross-sentence Contexts for Named Entity Recognition with BERT

Context-NER : Contextual Phrase Generation at Scale

Neural Named Entity Recognition from Subword Units

NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval

Classical Arabic Named Entity Recognition Using Variant Deep Neural Network Architectures and BERT

DistALANER: Distantly Supervised Active Learning Augmented Named Entity Recognition in the Open Source Software Ecosystem

NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data

Named Entity Recognition with Extremely Limited Data

Understanding Synthetic Context Extension via Retrieval Heads

Biomedical Named Entity Recognition at Scale

Long short-term memory RNN for biomedical named entity recognition

Enhancing Low Resource NER Using Assisting Language And Transfer Learning

Leveraging Contextual Information for Effective Entity Salience Detection

Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation

Named Entity Recognition in Multi-level Contexts