EmbeddingTree: Hierarchical Exploration of Entity Features in Embedding

Yan Zheng,Junpeng Wang,Chin-Chia Michael Yeh,Yujie Fan,Huiyuan Chen,Liang Wang,Wei Zhang
DOI: https://doi.org/10.1109/PacificVis56936.2023.00032
2023-08-03
Abstract:Embedding learning transforms discrete data entities into continuous numerical representations, encoding features/properties of the entities. Despite the outstanding performance reported from different embedding learning algorithms, few efforts were devoted to structurally interpreting how features are encoded in the learned embedding space. This work proposes EmbeddingTree, a hierarchical embedding exploration algorithm that relates the semantics of entity features with the less-interpretable embedding vectors. An interactive visualization tool is also developed based on EmbeddingTree to explore high-dimensional embeddings. The tool helps users discover nuance features of data entities, perform feature denoising/injecting in embedding training, and generate embeddings for unseen entities. We demonstrate the efficacy of EmbeddingTree and our visualization tool through embeddings generated for industry-scale merchant data and the public 30Music listening/playlists dataset.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The paper mainly addresses the following issues: ### Research Background and Objectives - **Background**: Embedding learning is a powerful technique that transforms discrete data entities into continuous numerical representations to encode the features or attributes of data entities. Although different embedding learning algorithms perform well in various tasks, there is relatively little work on explaining how these algorithms encode features in the learned embedding space. - **Objective**: This paper aims to propose a hierarchical embedding exploration algorithm `EmbeddingTree` and an interactive visualization tool to address the interpretability of embedding representations. Specifically, this method can relate the semantic relationships between data entity features and their corresponding, harder-to-interpret embedding vectors. ### Problems Addressed 1. **Structurally explaining feature encoding in embeddings**: Through `EmbeddingTree`, the authors aim to structurally explain how features are encoded in the learned embedding space. 2. **Improving the interpretability of embeddings**: By developing an interactive visualization tool, it helps users discover subtle features of data entities, perform feature denoising/injection during embedding training, and generate embedding representations for unseen data entities. 3. **Hierarchical feature exploration**: For cases where the importance of features in certain datasets varies and should be explored hierarchically, a hierarchical exploration method is proposed, where features form a nested structure, and users can explore from top to bottom. 4. **Handling feature and embedding inconsistency**: By constructing `EmbeddingTree`, it is possible to analyze the hierarchical importance of data entity features in embeddings, thereby discovering potential inconsistencies between features and embeddings. ### Main Contributions 1. **Proposed an algorithm based on Gaussian Mixture Model (GMM)** to extract feature hierarchies from high-dimensional embeddings. 2. **Developed a visual analysis tool** to help users effectively explore embedding data based on the extracted hierarchies. 3. **Case studies**: Demonstrated the effectiveness of the `EmbeddingTree` algorithm and its visualization tool, including studies on merchant embeddings from credit card transaction data and user/track embeddings from the public 30Music dataset. 4. **Application scenarios**: For example, for new merchants, even with limited historical information, `EmbeddingTree` can quickly find the most similar merchant groups to initialize their embedding representations. In summary, the main goal of this paper is to improve the interpretability of embedding representations, especially in scenarios with obvious feature hierarchies, by proposing novel methods and techniques to achieve this goal.