An algorithm for clustering with confidence-based must-link and cannot-link constraints

Philipp Baumann,Dorit S. Hochbaum
DOI: https://doi.org/10.1287/ijoc.2023.0419
2024-10-18
Abstract:We study here the semi-supervised $k$-clustering problem where information is available on whether pairs of objects are in the same or in different clusters. This information is either available with certainty or with a limited level of confidence. We introduce the PCCC (Pairwise-Confidence-Constraints-Clustering) algorithm, which iteratively assigns objects to clusters while accounting for the information provided on the pairs of objects. Our algorithm uses integer programming for the assignment of objects which allows to include relationships as hard constraints that are guaranteed to be satisfied or as soft constraints that can be violated subject to a penalty. This flexibility distinguishes our algorithm from the state-of-the-art in which all pairwise constraints are either considered hard, or all are considered soft. We developed an enhanced multi-start approach and a model-size reduction technique for the integer program that contributes to the effectiveness and the efficiency of the algorithm. Unlike existing algorithms, our algorithm scales to large-scale instances with up to 60,000 objects, 100 clusters, and millions of cannot-link constraints (which are the most challenging constraints to incorporate). We compare the PCCC algorithm with state-of-the-art approaches in an extensive computational study. Even though the PCCC algorithm is more general than the state-of-the-art approaches in its applicability, it outperforms the state-of-the-art approaches on instances with all hard or all soft constraints both in terms of runtime and various metrics of solution quality. The code of the PCCC algorithm is publicly available on GitHub.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the **semi - supervised clustering problem**, in which additional information about whether object pairs belong to the same cluster or different clusters is provided. This information can be deterministic (hard constraints) or with a certain degree of confidence (soft constraints). Specifically, the paper introduces a new algorithm - the PCCC (Pairwise - Confidence - Constraints - Clustering) algorithm, which aims to handle clustering problems containing hard and soft constraints. ### Problem Background In many practical applications, such as facility location, genomics, image segmentation and text analysis, etc., we often need to perform clustering based on some prior knowledge. This prior knowledge is usually given in the form of pairwise constraints, that is, some objects must be in the same cluster (must - link constraints), while other objects cannot be in the same cluster (cannot - link constraints). These constraints may be deterministic or with a certain degree of confidence. ### Limitations of Existing Methods Existing clustering algorithms can be roughly divided into three categories: 1. **Exact algorithms**: Using constraint programming, column generation, semidefinite programming and integer programming techniques, they can ensure that all hard constraints are met, but the computational cost is high and they can only solve small - scale problems. 2. **Center - based heuristic algorithms**: Considering pairwise constraints by sequentially allocating objects, they are fast, but it is difficult to find high - quality solutions when the number of constraints is large. 3. **Meta - heuristic algorithms**: Improving solutions by randomly modifying the assignment vector, but when a large number of constraints exist, the number of constraint violations is likely to increase. ### Advantages of the PCCC Algorithm The main innovations of the PCCC algorithm are as follows: - **Simultaneously handling hard and soft constraints**: Existing algorithms either only handle hard constraints or only handle soft constraints, while the PCCC algorithm can handle both types of constraints simultaneously. - **Using integer programming for object allocation**: Different from the traditional sequential allocation method, the PCCC algorithm uses integer programming to complete the object allocation step, thus better handling large - scale instances. - **Enhanced multi - start method**: By specifically modifying the convergent solution and continuing the iteration, the effectiveness of the algorithm is improved. - **Model size reduction technique**: By shrinking the objects connected by hard constraints in the pre - processing step, the size of the model is reduced and the efficiency is improved. ### Summary The goal of the paper is to develop a clustering algorithm that can effectively handle hard and soft constraints in large - scale datasets. The PCCC algorithm is not only innovative in theory, but also performs well in experiments and can outperform existing methods in various performance indicators.