D2.4.2 Approximate Activation Spreading Executive Summary

Ivan Peikov,Onto,M. Grinberg,Vladimir Haltakov,H. Stefanov,A. Kiryakov,Damyan Ognyanoff,Ruslan Velkov
Abstract:This deliverable describes two new approaches to the task of selecting ('priming') relevant nodes with respect to a query. Such a reduction of the queried graph can be viewed as a query optimization technique, especially useful in the context of sufficiently large datasets. Alternatively, 'priming' provides means for selection as needed by a LarKC selection plug‐in. The new approaches can complement or substitute the standard Spreading Activation (SA) method DualRDF, presented in D2.4.1 [1]. DualRDF is relatively slow because it is based on the multiplication of a high rank matrix with a vector (the rank is equal to the number of nodes in the dataset and can be of the order of millions). In this document two approximate SA methods are proposed, featuring high computational efficiency. Both methods perform extraction of a numerical connectivity matrix for the nodes in the dataset, based on the statements in the dataset, application of optimized storage format and operations on it. The first of them is called Cluster based SA (CbSA) and is based on the assumption that meaningful clusterings (i.e. sets of clusters) of the nodes in the dataset can be found. The current deliverable presents a formal framework for this approach, which will facilitate future analysis, implementation, and tests. The second approach presented, Node Selection based SA (NSbSA), takes advantage of the sparsity of the nodes' connectivity matrix and existing formats for compact representation of sparse matrices. In the course of work, several computational explorations were carried out. Various programming languages and hardware were compared which involved combinations of Matlab, Java, C, and GPU‐based (NVIDIA CUDA, [3]) computations, as well as different sparse matrix storage formats. The Linked Data Semantic Repository (LDSR, [4]), which contains about 100 million nodes and about 850 million statements, was used as a testbed for the NSbSA approach. The numerical evaluation indicates that NSbSA is quite promising for real‐time applications, when used with a CUDA device and optimized data storage. When applied to the LDSR dataset, NSbAS activated 65*10 6 nodes in 15 iterations for about 5 seconds (in comparison, in similar settings DualRDF selected about 75 thousand entities in 34 seconds). Possible combinations of DualRDF, implementing exact SA, and approximate, but more efficient, CbAS and NSbAS methods are also discussed.
What problem does this paper attempt to address?