Faisal N. Abu-Khzam,Lucas Isenmann,Sergio Thoumi
Abstract:Correlation clustering seeks a partition of the vertex set of a given graph/network into groups of closely related, or just close enough, vertices so that elements of different groups are not close to each other. The problem has been previously modeled and studied as a graph editing problem, namely Cluster Editing, which assumes that closely related data elements must be adjacent. As such, the main objective (of the Cluster Editing problem) is to turn clusters into cliques as a way to identify them. This is to be obtained via two main edge editing operations: additions and deletions. There are two problems with the Cluster Editing model that we seek to address in this paper. First, ``closely'' related does not necessarily mean ``directly'' related. So closeness should be measured by relatively short distance. As such, we seek to turn clusters into (sub)graphs of small diameter. Second, in real applications, a data element can belong, or have roles, in multiple groups. In some cases, without allowing data elements to belong to more than one cluster each, makes it hard to achieve any clustering via classical partition-based methods. We address this latter problem by allowing vertex cloning, also known as vertex splitting. Heuristic methods for the introduced problem are presented along with experimental results showing the effectiveness of the proposed model and algorithmic approach.
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on improving two key issues in the traditional correlation clustering model:
1. "Close" does not mean "directly" related: In the traditional Cluster Editing model, it is required that closely related data elements must be adjacent. However, in practical applications, being closely related does not necessarily mean being directly related, but rather refers to a relatively short distance. Therefore, the author proposes to transform clusters into subgraphs of small diameter instead of strict cliques.
2. Data elements are not allowed to belong to multiple clusters: In many real - world application scenarios, a data element may belong to multiple clusters simultaneously. For example, in the protein - protein interaction (PPI) network, a protein can have multiple biological functions. However, the traditional partitioning - based methods do not allow a data point to belong to multiple clusters, which makes it difficult to achieve effective clustering in some cases. To this end, the author introduces the concept of vertex cloning or vertex splitting, allowing a vertex to be split into multiple vertices, each belonging to a different cluster.
Specifically, the paper proposes a new model - **2 - Club Cluster Edge Deletion with Vertex Splitting (2CCEDVS)**, and combines heuristic algorithms to solve these problems. This model transforms the graph into a disjoint union composed of 2 - clubs through edge deletion and vertex splitting operations, thus allowing overlap between clusters and relaxing the requirement that elements within a cluster must be directly connected.
### Formulas and Definitions
Some key concepts and formulas involved in the paper are as follows:
- **s - club**: A set of vertices in which the distance between any two vertices does not exceed \( s \).
- For \( s = 2 \), that is, 2 - club, it means that the length of the shortest path between any two vertices does not exceed 2.
- **Cluster Editing Problem**:
- The goal is to transform a graph into a disjoint union composed of cliques by adding and deleting edges.
- Formally defined as: Given a graph \( G=(V, E) \) and a positive integer \( k \), can \( G \) be transformed into a disjoint union composed of cliques by at most \( k \) edge - editing operations (adding or deleting edges)?
- **2 - Club Cluster Edge Deletion with Vertex Splitting (2CCEDVS)**:
- The goal is to transform a graph into a disjoint union composed of \( s \)-clubs by at most \( k \) edge - deletion and vertex - splitting operations.
- Formally defined as: Given a graph \( G=(V, E) \) and positive integers \( s \) and \( k \), can \( G \) be transformed into a disjoint union composed of \( s \)-clubs by at most \( k \) edge - deletion and vertex - splitting operations?
### Conclusion
By introducing the concepts of 2 - club and vertex splitting, this paper aims to provide a more flexible and more practical correlation clustering method. The experimental results show that the proposed 2CCEDVS model exhibits good clustering quality and efficiency when processing synthetic data and real - life biological network data.