Quan Li,Kristanto Sean Njotoprawiro,Hammad Haleem,Qiaoan Chen,Chris Yi,Xiaojuan Ma
Abstract:Constructing latent vector representation for nodes in a network through embedding models has shown its practicality in many graph analysis applications, such as node classification, clustering, and link prediction. However, despite the high efficiency and accuracy of learning an embedding model, people have little clue of what information about the original network is preserved in the embedding vectors. The abstractness of low-dimensional vector representation, stochastic nature of the construction process, and non-transparent hyper-parameters all obscure understanding of network embedding results. Visualization techniques have been introduced to facilitate embedding vector inspection, usually by projecting the embedding space to a two-dimensional display. Although the existing visualization methods allow simple examination of the structure of embedding space, they cannot support in-depth exploration of the embedding vectors. In this paper, we design an exploratory visual analytics system that supports the comparative visual interpretation of embedding vectors at the cluster, instance, and structural levels. To be more specific, it facilitates comparison of what and how node metrics are preserved across different embedding models and investigation of relationships between node metrics and selected embedding vectors. Several case studies confirm the efficacy of our system. Experts' feedback suggests that our approach indeed helps them better embrace the understanding of network embedding models.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the lack of transparency and interpretability of network embedding results. Although network embedding techniques perform well in many graph analysis tasks, such as node classification, clustering, and link prediction, their internal structures and retained information are not transparent to users. Specifically, the paper focuses on the following aspects:
1. **Abstract Representation**: Typical embedding vectors have no clear meaning, and it is difficult to directly compare vectors in different embedding spaces, thus affecting the evaluation and utilization of embedding models.
2. **Inefficient Exploration**: Due to the randomness of the construction process and opaque hyper - parameters, users need to manually check the embedding results through trial and error, lacking an intuitive interaction mechanism.
3. **Shallow Analysis**: Although existing visualization tools can show the global geometric distribution and local neighbor relationships, they are limited in fine - grained analysis (such as studying and comparing the abilities of different embedding models to retain semantic and structural information).
To solve these problems, the paper proposes an interactive visual analysis system named **EmbeddingVis**, aiming to help machine - learning practitioners understand and compare different embedding models. This system supports fine - grained analysis from three levels: cluster, instance, and structure, and provides rich real - time interaction functions, enabling users to more effectively check the embedding results.
### Main Contributions
1. **Identifying Node Metric Correlations**: Through regression analysis, identify the node metric correlations between the graph space and the embedding space, and propose using the "average distance vector" to describe structural features.
2. **Enhanced Interactive Visualization**: Develop suitable interactive visualization methods to support fine - grained analysis of embedding vectors from three perspectives: cluster, instance, and structure.
3. **Practical Application Verification**: Verify the effectiveness of the system through cooperation with machine - learning practitioners and multiple case studies.
### Method Overview
1. **Regression Analysis**: Through regression analysis, understand how embedding vectors retain node metrics. Specifically, use a regression model to fit the relationship between the graph space and the embedding space, and calculate the importance of each node metric.
2. **Node Metrics**: Adopt multiple node metric indicators, including Degree, Betweenness, Leverage Centrality, K - nearest Neighbor (KNN), Closeness, PageRank, Within Module Degree, etc.
3. **Experimental Setup**: Evaluate the regression method on multiple real - world datasets, including csphd, citeseer, wiki, and email datasets. Set the same parameters for comparing embedding models, including DeepWalk, node2vec, and struc2vec.
### Experimental Results
Through experiments, the paper shows the performance of different embedding models in retaining node metrics. For example, DeepWalk and node2vec mainly retain Within Module Degree, while struc2vec mainly retains Degree. These results are verified by the feature importance analysis of the decision - tree regression model.
In conclusion, by proposing the **EmbeddingVis** system, this paper solves the transparency and interpretability problems of network embedding results, providing a powerful tool for machine - learning practitioners to help them better understand and utilize network embedding models.