Order preserving hierarchical agglomerative clustering

Daniel Bakkelund
DOI: https://doi.org/10.1007/s10994-021-06125-0
2021-09-09
Abstract:Partial orders and directed acyclic graphs are commonly recurring data structures that arise naturally in numerous domains and applications and are used to represent ordered relations between entities in the domains. Examples are task dependencies in a project plan, transaction order in distributed ledgers and execution sequences of tasks in computer programs, just to mention a few. We study the problem of order preserving hierarchical clustering of this kind of ordered data. That is, if we have $a < b$ in the original data and denote their respective clusters by $[a]$ and $[b]$, then we shall have $[a] < [b]$ in the produced clustering. The clustering is similarity based and uses standard linkage functions, such as single- and complete linkage, and is an extension of classical hierarchical clustering. To achieve this, we define the output from running classical hierarchical clustering on strictly ordered data to be partial dendrograms; sub-trees of classical dendrograms with several connected components. We then construct an embedding of partial dendrograms over a set into the family of ultrametrics over the same set. An optimal hierarchical clustering is defined as the partial dendrogram corresponding to the ultrametric closest to the original dissimilarity measure, measured in the p-norm. Thus, the method is a combination of classical hierarchical clustering and ultrametric fitting. A reference implementation is employed for experiments on both synthetic random data and real world data from a database of machine parts. When compared to existing methods, the experiments show that our method excels both in cluster quality and order preservation.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to perform hierarchical clustering on data with a partially ordered structure while maintaining the original order relationship of the data. Specifically, if there is a relationship of \(a < b\) in the original data, the corresponding clusters \([a]\) and \([b]\) after clustering should also satisfy the relationship of \([a] < [b]\). This order - preserving hierarchical clustering method is applicable to a variety of application scenarios, such as task dependencies in project plans, transaction sequences in distributed ledgers, and task execution sequences in computer programs. ### Overview of the Main Problem The core problem of the paper is to propose a hierarchical clustering method that can preserve the original order relationships of elements in a strict partially ordered set (strict poset). Specifically, given a set \(X\) with a strict partial order relationship and a similarity measure, the goal is to maintain the original order relationship during the clustering process and generate the optimal hierarchical clustering results at the same time. ### Key Points of the Solution 1. **Order - Preserving Hierarchical Clustering**: The paper proposes a new hierarchical clustering method to ensure that the original order relationship of the data is not destroyed during the clustering process. Specifically, if the input data is a set \(X\) with a strict partial order relationship and \(a < b\), then the clusters \([a]\) and \([b]\) after clustering should also satisfy \([a] < [b]\). 2. **Partial Dendrograms**: To achieve this goal, the paper introduces the concept of "partial dendrograms". A partial dendrogram is a subtree of a classical dendrogram and can represent multiple connected components. By embedding the partial dendrogram into the ultrametric space, the paper defines the optimal hierarchical clustering as the partial dendrogram corresponding to the ultrametric closest to the original distance measure. 3. **Optimization Method**: The paper adopts an optimization method to select the best clustering scheme, specifically by minimizing the matrix norm difference between the original distance measure and the ultrametric. This method is called ultrametric fitting and is an optimization - based hierarchical clustering method. 4. **Practical Applications**: The paper verifies the effectiveness of this method through experiments, including testing on synthetic random data and real data from a machine parts database. The experimental results show that this method is superior to existing methods in terms of clustering quality and order preservation. ### Formula Presentation - **Objective Function of Ultrametric Fitting**: \[ \min_{u \in U}\|D - u\|_p \] where \(D\) is the original distance matrix, \(u\) is the ultrametric matrix, and \(\|\cdot\|_p\) represents the \(p\)-norm. - **Embedding of Partial Dendrograms**: \[ \phi:\mathcal{P}(X)\to U(X) \] where \(\mathcal{P}(X)\) is the set of partial dendrograms, \(U(X)\) is the set of ultrametrics, and \(\phi\) is the embedding mapping. Through these methods, the paper successfully solves the problem of how to perform hierarchical clustering while maintaining the original order relationship of the data, and provides new ideas and tools for research in related fields.