Abstract:Historical interpretation benefits from identifying analogies among famous people: Who are the Lincolns, Einsteins, Hitlers, and Mozarts? As a knowledge source that benefits many applications in language processing and knowledge representation, Wikipedia provides the information we need to make such comparisons. We investigate several approaches to convert the Wikipedia pages of approximately 600,000 historical figures into vector representations to quantify similarity.On the other hand, Wikipedia pages are assigned to different categories according to their contents as human-annotated labels. A rough similarity estimation could just count the number of shared Wikipedia categories. However, such counting can neither make good similarity quantification (i.e. Is there a difference between those with same number of shared categories?) nor make distinguishable comments on different categories (i.e. Is US Presidents more important than state lawyer when defining similarity?). We use the counting as an indicator to demonstrate high-level agreements of our similarity detection algorithms.In particular, we investigate four different unsupervised approaches to representing the semantic associations of individuals: (1) TF-IDF, (2) Weighted average of distributed word embedding, (3) LDA Topic analysis and (4) Deepwalk graph embedding from page links. All proved effective, but the Deepwalk embedding yielded an overall accuracy of 88.23% in our evaluation. Combining LDA and Deepwalk yielded even higher performance.Finally, we demonstrate that our similarity measurements can also be used to recognize the most descriptive Wikipedia categories for historical figures.We rank the descriptive level of Wikipedia categories according to their categorical coherence, and our ranking yield an overall agreement of 88.27% compared with human crowdsourced data.

LlamaFur: Learning Latent Category Matrix to Find Unexpected Relations in Wikipedia

Cross-Lingual Entity Matching for Heterogeneous Online Wikis.

Contextual Categorization Enhancement through LLMs Latent-Space

Exploiting Level-Wise Category Links for Semantic Relatedness Computing

Generalized Relation Learning with Semantic Correlation Awareness for Link Prediction

Link Prediction in Multi-relational Graphs using Additive Models.

How Graph Structure and Label Dependencies Contribute to Node Classification in a Large Network of Documents

Predicting Unseen Links Using Learning-based Matrix Completion

Mining and Explaining Relationships in Wikipedia

Exploiting Wikipedia As External Knowledge For Document Clustering

Semantic Relationship Discovery with Wikipedia Structure

A Generalized Flow-Based Method for Analysis of Implicit Relationships on Wikipedia

Neural Cross-Lingual Entity Linking

LPFormer: An Adaptive Graph Transformer for Link Prediction

A Theory of Link Prediction via Relational Weisfeiler-Leman on Knowledge Graphs

Entity Insertion in Multilingual Linked Corpora: The Case of Wikipedia

Vector-based similarity measurements for historical figures

Graph-Based Text Similarity Measurement by Exploiting Wikipedia As Background Knowledge

Discriminative Nonparametric Latent Feature Relational Models with Data Augmentation

Llama2Vec: Unsupervised Adaptation of Large Language Models for Dense Retrieval

Knowledge Transfer Across Multilingual Corpora Via Latent Topics.