Abstract:Clustering is a main task of data mining, and its purpose is to identify natural structures in a dataset. The results of cluster analysis are not only related to the nature of the data itself but also to some priori conditions, such as clustering algorithms, similarity/dissimilarity, and parameters. For data without a clustering structure, clustering results need to be evaluated. For data with a clustering structure, different results obtained under different algorithms and parameters also need to be further optimized by clustering validation. Moreover, clustering validation is vital to clustering applications, especially when external information is not available. It is applied in algorithm selection, parameter determination, number of clusters determination. Most traditional internal clustering validation indices for numerical data fail to measure the categorical data. Categorical data is a popular data type, and its attribute value is discrete and cannot be ordered. For categorical data, the existing measures have their limitations in different application circumstances. In this paper, a new similarity based on the concentration ratio of every attribute value, called CONC, which can evaluate the similarity of objects in a cluster, was defined. Similarly, a new dissimilarity based on the discrepancy of characteristic attribute values, called DCRP, which can evaluate the dissimilarity between two clusters, was defined. A new internal clustering validation index, called CVC, which is based on CONC and DCRP, was proposed. Compared to other indices, CVC has three characteristics: (1) it evaluates the compactness of a cluster based on the information of the whole dataset and not only that of a cluster; (2) it evaluates the separation between two clusters by several characteristic attributes values so that the clustering information is not lost and the negative effects caused by noise are eliminated; (3) it evaluates the compactness and separation without influence from the number of objects. Furthermore, UCI benchmark datasets were used to compare the proposed index with other internal clustering validation indices (CU, CDCS, and IE). An external index (NMI) was used to evaluate the effect of these internal indices. According to the experiment results, CVC is more effective than the other internal clustering validation indices. In addition, CVC, as an internal index, is more applicable than the NMI external index, because it can evaluate the clustering results without external information.

A New Separation Measure for Improving the Effectiveness of Validity Indices

On the Index of Cluster Validity

A Distance-based Separability Measure for Internal Cluster Validation

An Internal Cluster Validity Index Using a Distance-based Separability Measure

Volume and Surface Area Based Cluster Validity Index

Boundary Matching and Interior Connectivity-Based Cluster Validity Anlysis

A New Validity Index Based on Intra-Cluster Variation and Inter-Cluster Overlap

A New Connectivity-Based Cluster Validity Index

New criteria for evaluating the validity of clustering

A GROUP OF NEW INDEXES OF CLUSTER VALIDITY

An Effective Partitional Clustering Algorithm Based on New Clustering Validity Index

An Unsupervised and Robust Validity Index for Clustering Analysis

Efficient synthetical clustering validity indexes for hierarchical clustering

Determining The Correct Number Of Clusters In The Ct Image Segmentation

On the Use of Relative Validity Indices for Comparing Clustering Approaches

A New Cluster Validity Index Based on the Adjustment of Within-Cluster Distance.

A New Internal Clustering Validation Index for Categorical Data Based on Concentration of Attribute Values

Clustering Validity Based on the Improved Hubert \Gamma Statistic and the Separation of Clusters

A comparative study of different cluster validity indexes

Particle Swarm Optimization Based Clustering: A Comparison of Different Cluster Validity Indices

A cluster validity index for fuzzy clustering