Beyond Boundaries: Learning a Universal Entity Taxonomy across Datasets and Languages for Open Named Entity Recognition

Yuming Yang,Wantong Zhao,Caishuang Huang,Junjie Ye,Xiao Wang,Huiyuan Zheng,Yang Nan,Yuran Wang,Xueying Xu,Kaixin Huang,Yunke Zhang,Tao Gui,Qi Zhang,Xuanjing Huang
2024-06-17
Abstract:Open Named Entity Recognition (NER), which involves identifying arbitrary types of entities from arbitrary domains, remains challenging for Large Language Models (LLMs). Recent studies suggest that fine-tuning LLMs on extensive NER data can boost their performance. However, training directly on existing datasets faces issues due to inconsistent entity definitions and redundant data, limiting LLMs to dataset-specific learning and hindering out-of-domain generalization. To address this, we present B2NERD, a cohesive and efficient dataset for Open NER, normalized from 54 existing English or Chinese datasets using a two-step approach. First, we detect inconsistent entity definitions across datasets and clarify them by distinguishable label names to construct a universal taxonomy of 400+ entity types. Second, we address redundancy using a data pruning strategy that selects fewer samples with greater category and semantic diversity. Comprehensive evaluation shows that B2NERD significantly improves LLMs' generalization on Open NER. Our B2NER models, trained on B2NERD, outperform GPT-4 by 6.8-12.0 F1 points and surpass previous methods in 3 out-of-domain benchmarks across 15 datasets and 6 languages.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges faced by large - language models (LLMs) in the open - domain named - entity recognition (Open NER) task. Specifically, although existing large - language models have made remarkable progress in some aspects, they still have deficiencies in handling complex entity classification and cross - domain generalization ability. These problems mainly stem from: 1. **Inconsistent entity definitions**: Different datasets have different definitions for the same type of entity, which leads to confusion in the model during training and inference. For example, some datasets distinguish between locations such as "Times Square" and geopolitical entities such as "Paris", while other datasets label both as "LOC". 2. **Data redundancy**: Most datasets over - annotate common entity types, while there are fewer samples of long - tail entity types. This unbalanced data distribution may cause the model to over - fit on common entities, thus affecting its generalization ability. To address these challenges, the author proposes a two - step method to construct an efficient and consistent open - named - entity - recognition dataset (B2NERD), and train a model (B2NER) with stronger generalization ability through this dataset. The specific steps include: 1. **Standardization of entity definitions**: Solve the inconsistency of entity definitions in different datasets through automatic detection and expert review, and form a general entity classification system containing more than 400 entity types. 2. **Diversity - aware data pruning**: Reduce data redundancy and improve the generalization ability of the model by selecting semantically diverse samples under each entity type. Experimental results show that the B2NER model significantly outperforms GPT - 4 and other existing methods in multiple benchmark tests, especially in cross - domain generalization.