Abstract:Partial orders and directed acyclic graphs are commonly recurring data structures that arise naturally in numerous domains and applications and are used to represent ordered relations between entities in the domains. Examples are task dependencies in a project plan, transaction order in distributed ledgers and execution sequences of tasks in computer programs, just to mention a few. We study the problem of order preserving hierarchical clustering of this kind of ordered data. That is, if we have $a < b$ in the original data and denote their respective clusters by $[a]$ and $[b]$, then we shall have $[a] < [b]$ in the produced clustering. The clustering is similarity based and uses standard linkage functions, such as single- and complete linkage, and is an extension of classical hierarchical clustering. To achieve this, we define the output from running classical hierarchical clustering on strictly ordered data to be partial dendrograms; sub-trees of classical dendrograms with several connected components. We then construct an embedding of partial dendrograms over a set into the family of ultrametrics over the same set. An optimal hierarchical clustering is defined as the partial dendrogram corresponding to the ultrametric closest to the original dissimilarity measure, measured in the p-norm. Thus, the method is a combination of classical hierarchical clustering and ultrametric fitting. A reference implementation is employed for experiments on both synthetic random data and real world data from a database of machine parts. When compared to existing methods, the experiments show that our method excels both in cluster quality and order preservation.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to perform hierarchical clustering on data with a partially ordered structure while maintaining the original order relationship of the data. Specifically, if there is a relationship of $a < b$ in the original data, the corresponding clusters $[a]$ and $[b]$ after clustering should also satisfy the relationship of $[a] < [b]$. This order - preserving hierarchical clustering method is applicable to a variety of application scenarios, such as task dependencies in project plans, transaction sequences in distributed ledgers, and task execution sequences in computer programs. ### Overview of the Main Problem The core problem of the paper is to propose a hierarchical clustering method that can preserve the original order relationships of elements in a strict partially ordered set (strict poset). Specifically, given a set $X$ with a strict partial order relationship and a similarity measure, the goal is to maintain the original order relationship during the clustering process and generate the optimal hierarchical clustering results at the same time. ### Key Points of the Solution 1. **Order - Preserving Hierarchical Clustering**: The paper proposes a new hierarchical clustering method to ensure that the original order relationship of the data is not destroyed during the clustering process. Specifically, if the input data is a set $X$ with a strict partial order relationship and $a < b$, then the clusters $[a]$ and $[b]$ after clustering should also satisfy $[a] < [b]$. 2. **Partial Dendrograms**: To achieve this goal, the paper introduces the concept of "partial dendrograms". A partial dendrogram is a subtree of a classical dendrogram and can represent multiple connected components. By embedding the partial dendrogram into the ultrametric space, the paper defines the optimal hierarchical clustering as the partial dendrogram corresponding to the ultrametric closest to the original distance measure. 3. **Optimization Method**: The paper adopts an optimization method to select the best clustering scheme, specifically by minimizing the matrix norm difference between the original distance measure and the ultrametric. This method is called ultrametric fitting and is an optimization - based hierarchical clustering method. 4. **Practical Applications**: The paper verifies the effectiveness of this method through experiments, including testing on synthetic random data and real data from a machine parts database. The experimental results show that this method is superior to existing methods in terms of clustering quality and order preservation. ### Formula Presentation - **Objective Function of Ultrametric Fitting**: \[ \min_{u \in U}\|D - u\|_p \] where $D$ is the original distance matrix, $u$ is the ultrametric matrix, and $\|\cdot\|_p$ represents the $p$-norm. - **Embedding of Partial Dendrograms**: \[ \phi:\mathcal{P}(X)\to U(X) \] where $\mathcal{P}(X)$ is the set of partial dendrograms, $U(X)$ is the set of ultrametrics, and $\phi$ is the embedding mapping. Through these methods, the paper successfully solves the problem of how to perform hierarchical clustering while maintaining the original order relationship of the data, and provides new ideas and tools for research in related fields.

Order preserving hierarchical agglomerative clustering

An objective function for order preserving hierarchical clustering

Data Aggregation for Hierarchical Clustering

Scalable Hierarchical Agglomerative Clustering

Data Structures & Algorithms for Exact Inference in Hierarchical Clustering

Improved Hierarchical Clustering on Massive Datasets with Broad Guarantees

OPHCLUS:an order-preserving based hierarchical clustering algorithm

Hierarchical Clustering: Objective Functions and Algorithms

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

Fair Algorithms for Hierarchical Agglomerative Clustering

The Price of Hierarchical Clustering

A Novel Hierarchical Clustering Approach Based on Universal Gravitation

On The Equivalence of Tries and Dendrograms - Efficient Hierarchical Clustering of Traffic Data

Mining Arbitrary Shaped Clusters and Outputting a High Quality Dendrogram.

Robust Hierarchical Clustering for Directed Networks: An Axiomatic Approach

Online Hierarchical Clustering Approximations

Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities

The Impact of Isolation Kernel on Agglomerative Hierarchical Clustering Algorithms

Supervised Hierarchical Clustering with Exponential Linkage

Hierarchical Overlapping Clustering of Network Data Using Cut Metrics

mdendro: An R package for extended agglomerative hierarchical clustering