Abstract:Cluster analysis aims at classifying data elements into different categories according to their similarity. It is a common task in data mining and useful in various field including pattern recognition, machine learning, information retrieval and so on. As an extensive studied area, many clustering methods are proposed in literature. Among them, some methods are focused on mining clusters with arbitrary shapes. However, when dealing with large-scale and multi-dimensional data, there is still a need for an efficient and versatile clustering method to identify these arbitrary shapes that may be embedded in these multi-dimensional space. In this paper, we propose a density-based clustering algorithm that adopts a divide-and-conquer strategy. To handle large-scale and multi-dimensional data, we first divide the data by grid cells. It is very efficient in large-scale cases where other algorithms often fail. Moreover, rather than tuning the grid cell width, we present a way to automatically determine the grid cell width. Then, we propose a flood-filling like algorithm to identify the clusters with arbitrary shapes over these grid cells. Finally, extensive experiments are conducted in both synthetic databases and real-world databases, showing that the proposed algorithm efficiently finds accurate clusters in both low-dimensional and multi-dimensional databases.

Application and Research of Data Partition in Large Scale Database During Clustering

Application Of Clustering Technologies In Very Large Power Grid Dispatch And Control System

A Fast Algorithm for Density-Based Clustering in Large Database

A Parallel Varied Density-Based Clustering Algorithm with Optimized Data Partition

A spatial data partition algorithm based on statistical cluster

A Study of Performance Optimization Method for Massive Spaito-temporal Data Based on Spatio-temporal Partition Clustering

A Statistical Information-Based Clustering Approach in Distance Space

Clustering in Very Large Databases Based on Distance and Density

Partition-based DBSCAN algorithm with different parameter

Study of Fast Parallel Clustering Partition Algorithm for Large Data Sets

A Hybrid Approach to Clustering in Very Large Databases

Research and Implementation on Hybrid Clustering Algorithm in Big Data Processing

Partition Affinity Propagation for Clustering Large Scale of Data in Digital Library

Spatial data partitioning based on the clustering of minimum distance criterion

An Efficient Density-Based Clustering for Multi-Dimensional Database

Logical Data Partitioning for Shared-Disk Database Cluster

The Research and Analysis of Date Clustering

New method to improve DBSCAN clustering algorithm quality

An Hierarchical Clustering Method Based on Data Fields

Research on Dynamic Data Partition of Database Management System in Distributed Information Network

Research on Partition Storage and Query of Cloud Data