Personal Names Popularity Estimation and its Application to Record Linkage

Ksenia Zhagorina,Pavel Braslavski,Vladimir Gusev
DOI: https://doi.org/10.48550/arXiv.1811.05361
2018-11-13
Databases
Abstract:This study deals with a fairly simply formulated problem -- how to estimate the number of people bearing the same full name in a large population. Estimation of name popularity can leverage personal name matching in databases and be of interest for many other domains. A distinctive feature of large collections of names is that they contain a large number of unique items, which is challenging for statistical modeling. We investigate a number of statistical techniques and also propose a simple yet effective method aimed at obtaining more accurate count estimates. In our experiments we use a dataset containing about 20 million name occurrences that correspond to about 13 million real-world persons. We perform a thorough evaluation of the name count estimation methods and a record linkage experiment guided by name popularity estimates. Obtained results suggest that theoretically informed approaches outperform simple heuristics and can be useful in a variety of applications.
What problem does this paper attempt to address?