Selecting Optimal Trace Clustering Pipelines with AutoML

Sylvio Barbon Jr,Paolo Ceravolo,Ernesto Damiani,Gabriel Marques Tavares
DOI: https://doi.org/10.48550/arXiv.2109.00635
2021-09-02
Abstract:Trace clustering has been extensively used to preprocess event logs. By grouping similar behavior, these techniques guide the identification of sub-logs, producing more understandable models and conformance analytics. Nevertheless, little attention has been posed to the relationship between event log properties and clustering quality. In this work, we propose an Automatic Machine Learning (AutoML) framework to recommend the most suitable pipeline for trace clustering given an event log, which encompasses the encoding method, clustering algorithm, and its hyperparameters. Our experiments were conducted using a thousand event logs, four encoding techniques, and three clustering methods. Results indicate that our framework sheds light on the trace clustering problem and can assist users in choosing the best pipeline considering their scenario.
Machine Learning,Information Retrieval,Software Engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to automatically recommend the most suitable trajectory clustering pipeline according to the characteristics of event logs, so as to improve the quality of trajectory clustering. Specifically, the paper aims to solve the following problems: 1. **Selecting appropriate encoding methods**: Different encoding methods will transform trajectories into different forms of feature vectors or calculate similarity measures, thus affecting the clustering results. 2. **Selecting appropriate clustering algorithms**: Different clustering algorithms are based on different heuristic methods and are suitable for different types of data and scenarios. 3. **Optimizing the hyper - parameters of clustering algorithms**: Even for the same clustering algorithm, different results will be produced under different hyper - parameter settings. To achieve this goal, the author proposes a framework based on AutoML (Automated Machine Learning), which uses Meta - learning to recommend the best trajectory clustering pipeline. The specific steps are as follows: - **Meta - feature extraction**: Extract meta - features from event logs, which can describe the characteristics of the logs. - **Meta - target definition**: Define a series of candidate encoding methods, clustering algorithms and their hyper - parameter combinations, and evaluate and rank them through quality indicators. - **Meta - database construction**: Combine meta - features and meta - targets to form a meta - data set. - **Meta - learning**: Train a meta - model based on the meta - data set, which can recommend the optimal encoding method, clustering algorithm and hyper - parameter combination according to new event logs. ### Formula presentation In the paper, the author uses several key formulas to evaluate the clustering quality. Here are two important formulas: 1. **Silhouette Coefficient**: \[ s=\frac{b - a}{\max(a, b)} \] where \(a\) is the average distance from a sample to other points in the same cluster, and \(b\) is the average distance from the sample to all points in the nearest cluster. The value range of the Silhouette Coefficient is \([-1, 1]\), where \(1\) represents the best clustering effect, \(0\) represents overlapping clusters, and \(-1\) represents the worst effect. 2. **Variant Score**: \[ v = \sum_{C_i\in C}\frac{\# \text{variants}- 1}{\# \text{traces}} \] where \(\# \text{variants}\) is the number of unique trajectories in cluster \(C_i\), and \(\# \text{traces}\) is the total number of trajectories in the event log. The optimal value of the Variant Score is \(0\), indicating clear separation of variants within the cluster. Through these formulas and methods, the method proposed in the paper can recommend the optimal trajectory clustering pipeline given an event log, thereby improving the clustering quality and interpretability.