Abstract:Trace clustering has been extensively used to preprocess event logs. By grouping similar behavior, these techniques guide the identification of sub-logs, producing more understandable models and conformance analytics. Nevertheless, little attention has been posed to the relationship between event log properties and clustering quality. In this work, we propose an Automatic Machine Learning (AutoML) framework to recommend the most suitable pipeline for trace clustering given an event log, which encompasses the encoding method, clustering algorithm, and its hyperparameters. Our experiments were conducted using a thousand event logs, four encoding techniques, and three clustering methods. Results indicate that our framework sheds light on the trace clustering problem and can assist users in choosing the best pipeline considering their scenario.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to automatically recommend the most suitable trajectory clustering pipeline according to the characteristics of event logs, so as to improve the quality of trajectory clustering. Specifically, the paper aims to solve the following problems: 1. **Selecting appropriate encoding methods**: Different encoding methods will transform trajectories into different forms of feature vectors or calculate similarity measures, thus affecting the clustering results. 2. **Selecting appropriate clustering algorithms**: Different clustering algorithms are based on different heuristic methods and are suitable for different types of data and scenarios. 3. **Optimizing the hyper - parameters of clustering algorithms**: Even for the same clustering algorithm, different results will be produced under different hyper - parameter settings. To achieve this goal, the author proposes a framework based on AutoML (Automated Machine Learning), which uses Meta - learning to recommend the best trajectory clustering pipeline. The specific steps are as follows: - **Meta - feature extraction**: Extract meta - features from event logs, which can describe the characteristics of the logs. - **Meta - target definition**: Define a series of candidate encoding methods, clustering algorithms and their hyper - parameter combinations, and evaluate and rank them through quality indicators. - **Meta - database construction**: Combine meta - features and meta - targets to form a meta - data set. - **Meta - learning**: Train a meta - model based on the meta - data set, which can recommend the optimal encoding method, clustering algorithm and hyper - parameter combination according to new event logs. ### Formula presentation In the paper, the author uses several key formulas to evaluate the clustering quality. Here are two important formulas: 1. **Silhouette Coefficient**: \[ s=\frac{b - a}{\max(a, b)} \] where \(a\) is the average distance from a sample to other points in the same cluster, and \(b\) is the average distance from the sample to all points in the nearest cluster. The value range of the Silhouette Coefficient is \([-1, 1]\), where \(1\) represents the best clustering effect, \(0\) represents overlapping clusters, and \(-1\) represents the worst effect. 2. **Variant Score**: \[ v = \sum_{C_i\in C}\frac{\# \text{variants}- 1}{\# \text{traces}} \] where \(\# \text{variants}\) is the number of unique trajectories in cluster \(C_i\), and \(\# \text{traces}\) is the total number of trajectories in the event log. The optimal value of the Variant Score is \(0\), indicating clear separation of variants within the cluster. Through these formulas and methods, the method proposed in the paper can recommend the optimal trajectory clustering pipeline given an event log, thereby improving the clustering quality and interpretability.

Selecting Optimal Trace Clustering Pipelines with AutoML

Problem-oriented AutoML in Clustering

CLAMS: A System for Zero-Shot Model Selection for Clustering

A Survey on AutoML Methods and Systems for Clustering

Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities

From Point-wise to Group-wise: A Fast and Accurate Microservice Trace Anomaly Detection Approach

SKTR: Trace Recovery from Stochastically Known Logs

XAutoML: A Visual Analytics Tool for Understanding and Validating Automated Machine Learning

STREAMLINE: A Simple, Transparent, End-To-End Automated Machine Learning Pipeline Facilitating Data Analysis and Algorithm Comparison

Data Pipeline Training: Integrating AutoML to Optimize the Data Flow of Machine Learning Models

Explainable AI-Based Ensemble Clustering for Load Profiling and Demand Response

Automated machine learning with dynamic ensemble selection

Structural Feature Selection for Event Logs

Trace Encoding in Process Mining: a survey and benchmarking

Tracing and Visualizing Human-ML/AI Collaborative Processes through Artifacts of Data Work

FlowTracer: A Tool for Uncovering Network Path Usage Imbalance in AI Training Clusters

AutoEn: An AutoML method based on ensembles of predefined Machine Learning pipelines for supervised Traffic Forecasting

Selection and Application of Machine Learning- Algorithms in Production Quality

Traceable Automatic Feature Transformation via Cascading Actor-Critic Agents

TRACE: Transformer-based user Representations from Attributed Clickstream Event sequences