Abstract:ABSTRACTClustering is a widely used technique in data mining applications to discover patterns in the underlying data. Most traditional clustering algorithms are limited to handling datasets that contain either continuous or categorical attributes. However, datasets with mixed types of attributes are common in real life data mining problems. In this paper, we propose a distance measure that enables clustering data with both continuous and categorical attributes. This distance measure is derived from a probabilistic model that the distance between two clusters is equivalent to the decrease in log-likelihood function as a result of merging. Calculation of this measure is memory efficient as it depends only on the merging cluster pair and not on all the other clusters. Zhang et al [8] proposed a clustering method named BIRCH that is especially suitable for very large datasets. We develop a clustering algorithm using our distance measure based on the framework of BIRCH. Similar to BIRCH, our algorithm first performs a pre-clustering step by scanning the entire dataset and storing the dense regions of data records in terms of summary statistics. A hierarchical clustering algorithm is then applied to cluster the dense regions. Apart from the ability of handling mixed type of attributes, our algorithm differs from BIRCH in that we add a procedure that enables the algorithm to automatically determine the appropriate number of clusters and a new strategy of assigning cluster membership to noisy data. For data with mixed type of attributes, our experimental results confirm that the algorithm not only generates better quality clusters than the traditional k-means algorithms, but also exhibits good scalability properties and is able to identify the underlying number of clusters in the data correctly. The algorithm is implemented in the commercial data mining tool Clementine 6.0 which supports the PMML standard of data mining model deployment.

A Hybrid Approach to Clustering in Very Large Databases

A Fast Algorithm for Density-Based Clustering in Large Database

A Novel Kernel Possibitistic Fuzzy C-Means Clustering Algorithm For Large Scale Data Sets

A boosted clustering algorithm for distributed homogeneous data mining

Using Visualization to Improve Clustering Analysis on Heterogeneous Information Network.

A robust and scalable clustering algorithm for mixed type attributes in large database environment.

A Parallel Varied Density-Based Clustering Algorithm with Optimized Data Partition

Approaches for Scaling Dbscan Algorithm to Large Spatial Databases

A Statistical Information-Based Clustering Approach in Distance Space

Combining Sampling Technique With Dbscan Algorithm For Clustering Large Spatial Databases

Scalable Co-Clustering for Large-Scale Data through Dynamic Partitioning and Hierarchical Merging

Towards effective and efficient mining of arbitrary shaped clusters

Enhanced Locality Sensitive Clustering in High Dimensional Space

Fuzzy hierarchical clustering algorithm facing large databases

Sequential Combination Methods for Data Clustering Analysis

Partition Affinity Propagation for Clustering Large Scale of Data in Digital Library

A New Clustering Method Suitable for Large Scale Data

GriT-DBSCAN: A spatial clustering algorithm for very large databases

A novel hybridization approach to improve the critical distance clustering algorithm: Balancing speed and quality

Clustering Large Datasets by Merging K-Means Solutions

A Highly Scalable Clustering Scheme Using Boundary Information