Abstract:When a digital library user searches for publications by an author name, she often sees a mixture of publications by different authors who have the same name. With the growth of digital libraries and involvement of more authors, this author ambiguity problem is becoming critical. Author disambiguation (AD) often tries to solve this problem by leveraging metadata such as coauthors, research topics, publication venues and citation information, since more personal information such as the contact details is often restricted or missing. In this paper, we study the problem of how to efficiently disambiguate author names given an incessant stream of published papers. To this end, we propose a “BatchAD+IncAD” framework for dynamic author disambiguation. First, we perform batch author disambiguation (BatchAD) to disambiguate all author names at a given time by grouping all records (each record refers to a paper with one of its author names) into disjoint clusters. This establishes a one-to-one mapping between the clusters and real-world authors. Then, for newly added papers, we periodically perform incremental author disambiguation (IncAD), which determines whether each new record can be assigned to an existing cluster, or to a new cluster not yet included in the previous data. Based on the new data, IncAD also tries to correct previous AD results. Our main contributions are: (1) We demonstrate with real data that a small number of new papers often have overlapping author names with a large portion of existing papers, so it is challenging for IncAD to effectively leverage previous AD results. (2) We propose a novel IncAD model which aggregates metadata from a cluster of records to estimate the author’s profile such as her coauthor distributions and keyword distributions, in order to predict how likely it is that a new record is “produced” by the author. (3) Using two labeled datasets and one large-scale raw dataset, we show that the proposed method is much more efficient than state-of-the-art methods while ensuring high accuracy.

Generating automatically labeled data for author name disambiguation: an iterative clustering method

An Effective Approach for Automatic Author Name Disambiguation Based on Multiple Strategies

Evaluating author name disambiguation for digital libraries: A case of DBLP

Distortive Effects of Initial-Based Name Disambiguation on Measurements of Large-Scale Coauthorship Networks

A supervised and distributed framework for cold-start author disambiguation in large-scale publications

Unsupervised Author Disambiguation Using Dempster–Shafer Theory

A fast and integrative algorithm for clustering performance evaluation in author name disambiguation

Dynamic author name disambiguation for growing digital libraries

A framework for constructing a huge name disambiguation dataset: algorithms, visualization and human collaboration

Combining Machine Learning and Human Judgment in Author Disambiguation

Author Name Disambiguation Based on Heterogeneous Graph

LAGOS-AND: A Large Gold Standard Dataset for Scholarly Author Name Disambiguation

Large Scale Name Disambiguation Using Rule-Based Post Processing Combined with Aminer

A Bayesian Learning, Greedy agglomerative clustering approach and evaluation techniques for Author Name Disambiguation Problem

Exploiting citation networks for large-scale author name disambiguation

Data sets for author name disambiguation: an empirical analysis and a new resource

Aggregating large-scale databases for PubMed author name disambiguation

A Graph-Based Author Name Disambiguation Method and Analysis via Information Theory

Effect of forename string on author name disambiguation

Name Disambiguation By Collective Classification

Author Name Disambiguation Based on Rule and Graph Model