Domain and Range Aware Synthetic Negatives Generation for Knowledge Graph Embedding Models

Alberto Bernardi,Luca Costabello
2024-11-22
Abstract:Knowledge Graph Embedding models, representing entities and edges in a low-dimensional space, have been extremely successful at solving tasks related to completing and exploring Knowledge Graphs (KGs). One of the key aspects of training most of these models is teaching to discriminate between true statements positives and false ones (negatives). However, the way in which negatives can be defined is not trivial, as facts missing from the KG are not necessarily false and a set of ground truth negatives is hardly ever given. This makes synthetic negative generation a necessity. Different generation strategies can heavily affect the quality of the embeddings, making it a primary aspect to consider. We revamp a strategy that generates corruptions during training respecting the domain and range of relations, we extend its capabilities and we show our methods bring substantial improvement (+10% MRR) for standard benchmark datasets and over +150% MRR for a larger ontology-backed dataset.
Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve a key problem in the training of Knowledge Graph Embedding (KGE) models: **How to generate high - quality synthetic negatives**. Specifically, KGE models complete and explore Knowledge Graphs (KGs) by learning low - dimensional representations of entities and relations. When training these models, a core task is to teach the model to distinguish between true statements (positive samples) and false statements (negative samples). However, generating appropriate negative samples is not easy because: 1. **Missing facts are not necessarily false**: Facts not included in the KG are not necessarily wrong. 2. **Lack of a real negative sample set**: Usually, there is no truly labeled negative sample provided. These problems make **the generation of synthetic negative samples** an indispensable part of the training process. Different negative sample generation strategies will significantly affect the embedding quality, so it is crucial to choose the appropriate strategy. To solve the above problems, this paper proposes a new negative sample generation method, namely **the domain - and range - aware synthetic negative sample generation method**. This method takes into account the domain and range of relations when generating negative samples, thereby improving the quality and diversity of negative samples. Experimental results show that this method brings significant performance improvements on standard benchmark datasets (the Mean Reciprocal Rank (MRR) is increased by about 10%), and on larger - scale ontology - supported datasets, the MRR is increased by more than 150%. ### Specific improvement points - **Combined with random uniform sampling**: To avoid the problem of repeated sampling caused by too few instances in some categories, the author combines domain - and range - based negative sample generation with random uniform sampling to ensure the diversity and effectiveness of negative samples. - **Applicable to different types of KGs**: This method is not only applicable to standard benchmark datasets, but also particularly applicable to biological datasets with a clear ontology structure (such as Hetionet). In these datasets, ontology - defined classes can generate more meaningful and diverse negative samples. - **Reduce computational overhead**: Compared with other complex negative sample generation methods, the method proposed in this paper has very little computational overhead and can maintain an efficient training process while ensuring performance improvement. In conclusion, this paper significantly improves the performance of KGE models by improving the negative sample generation strategy and provides a valuable reference for subsequent research.