Abstract:Open Named Entity Recognition (NER), which involves identifying arbitrary types of entities from arbitrary domains, remains challenging for Large Language Models (LLMs). Recent studies suggest that fine-tuning LLMs on extensive NER data can boost their performance. However, training directly on existing datasets faces issues due to inconsistent entity definitions and redundant data, limiting LLMs to dataset-specific learning and hindering out-of-domain generalization. To address this, we present B2NERD, a cohesive and efficient dataset for Open NER, normalized from 54 existing English or Chinese datasets using a two-step approach. First, we detect inconsistent entity definitions across datasets and clarify them by distinguishable label names to construct a universal taxonomy of 400+ entity types. Second, we address redundancy using a data pruning strategy that selects fewer samples with greater category and semantic diversity. Comprehensive evaluation shows that B2NERD significantly improves LLMs' generalization on Open NER. Our B2NER models, trained on B2NERD, outperform GPT-4 by 6.8-12.0 F1 points and surpass previous methods in 3 out-of-domain benchmarks across 15 datasets and 6 languages.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges faced by large - language models (LLMs) in the open - domain named - entity recognition (Open NER) task. Specifically, although existing large - language models have made remarkable progress in some aspects, they still have deficiencies in handling complex entity classification and cross - domain generalization ability. These problems mainly stem from: 1. **Inconsistent entity definitions**: Different datasets have different definitions for the same type of entity, which leads to confusion in the model during training and inference. For example, some datasets distinguish between locations such as "Times Square" and geopolitical entities such as "Paris", while other datasets label both as "LOC". 2. **Data redundancy**: Most datasets over - annotate common entity types, while there are fewer samples of long - tail entity types. This unbalanced data distribution may cause the model to over - fit on common entities, thus affecting its generalization ability. To address these challenges, the author proposes a two - step method to construct an efficient and consistent open - named - entity - recognition dataset (B2NERD), and train a model (B2NER) with stronger generalization ability through this dataset. The specific steps include: 1. **Standardization of entity definitions**: Solve the inconsistency of entity definitions in different datasets through automatic detection and expert review, and form a general entity classification system containing more than 400 entity types. 2. **Diversity - aware data pruning**: Reduce data redundancy and improve the generalization ability of the model by selecting semantically diverse samples under each entity type. Experimental results show that the B2NER model significantly outperforms GPT - 4 and other existing methods in multiple benchmark tests, especially in cross - domain generalization.

Beyond Boundaries: Learning a Universal Entity Taxonomy across Datasets and Languages for Open Named Entity Recognition

Incorporating Large Language Models into Named Entity Recognition: Opportunities and Challenges

Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark

EduNER: a Chinese Named Entity Recognition Dataset for Education Research

GEIC: Universal and Multilingual Named Entity Recognition with Large Language Models

UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition

Towards Open-Domain Named Entity Recognition via Neural Correction Models

LinkNER: Linking Local Named Entity Recognition Models to Large Language Models using Uncertainty

Towards Malay named entity recognition: an open-source dataset and a multi-task framework

VANER: Leveraging Large Language Model for Versatile and Adaptive Biomedical Named Entity Recognition

An Open-Source Dataset and A Multi-Task Model for Malay Named Entity Recognition

DMNER: Biomedical Entity Recognition by Detection and Matching

GPT-NER: Named Entity Recognition via Large Language Models

LTNER: Large Language Model Tagging for Named Entity Recognition with Contextualized Entity Marking

Neural Correction Model for Open-Domain Named Entity Recognition

Enhanced Chinese Domain Named Entity Recognition: An Approach with Lexicon Boundary and Frequency Weight Features

NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval

NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data

LLM-DER:A Named Entity Recognition Method Based on Large Language Models for Chinese Coal Chemical Domain