DISC: Disambiguating homonyms using graph structural clustering
Ijaz Hussain,Sohail Asghar
DOI: https://doi.org/10.1177/0165551518761011
2018-03-05
Journal of Information Science
Abstract:Author name ambiguity degrades information retrieval, database integration, search results and, more importantly, correct attributions in bibliographic databases. Some unresolved issues include how to ascertain the actual number of authors, how to improve the performance and how to make the method more effective in terms of representative clustering metrics (average cluster purity, average author purity, K-metric, pairwise precision, pairwise recall, pairwise-F1, cluster precision, cluster recall and cluster-F1). It is a non-trivial task to disambiguate authors using only the implicit bibliographic information. An effective method ‘DISC’ is proposed that uses graph community detection algorithm, feature vectors and graph operations to disambiguate homonyms. The citation data set is pre-processed and ambiguous author blocks are formed. A co-authors graph is constructed using authors and their co-author’s relationships. A graph structural clustering ‘gSkeletonClu’ is applied to identify hubs, outliers and clusters of nodes in a co-author’s graph. Homonyms are resolved by splitting these clusters of nodes across the hub if their feature vector similarity is less than a predefined threshold. DISC utilises only co-authors and titles that are available in almost all bibliographic databases. With little modifications, DISC can also be used for entity disambiguation. To validate the DISC performance, experiments are performed on two Arnetminer data sets and compared with five previous unsupervised methods. Despite using limited bibliographic metadata, DISC achieves on average K-metric, pairwise-F1, and cluster-F1 of 92%, 84% and 74%, respectively, using Arnetminer-S and 86%, 80% and 57%, respectively, using Arnetminer-L. About 77.5% and 73.2% clusters are within the range (ground truth clusters ± 3) in Arnetminer-S and Arnetminer-L, respectively.
computer science, information systems,information science & library science