Abstract:Clustering is one of the most effective methods for analyzing datasets that contain a large number of objects with numerous attributes. Clustering seeks to identify groups, or clusters, of similar objects. In low dimensional space, the similarity between objects is often evaluated by summing the difference across all of their attributes. High dimensional data, however, may contain irrelevant attributes which mask the existence of clusters. The discovery of groups of objects that are highly similar within some subsets of relevant attributes becomes an important but challenging task. My thesis focuses on various models and algorithms for this task. We first present a flexible clustering model, namely OP-Cluster (Order Preserving Cluster). Under this model, two objects are similar on a subset of attributes if the values of these two objects induce the same relative ordering of these attributes. OP-Clustering algorithm has demonstrated to be useful to identify co-regulated genes in gene expression data. We also propose a semi-supervised approach to discover biologically meaningful OP-Clusters by incorporating existing gene function classifications into the clustering process. This semi-supervised algorithm yields only OP-clusters that are significantly enriched by genes from specific functional categories. Real datasets are often noisy. We propose a noise-tolerant clustering algorithm for mining frequently occurring itemsets. This algorithm is called approximate frequent itemsets (AFI). Both the theoretical and experimental results demonstrate that our AFI mining algorithm has higher recoverability of real clusters than any other existing itemset mining approaches. Pair-wise dissimilarities are often derived from original data to reduce the complexities of high dimensional data. Traditional clustering algorithms taking pair-wise dissimilarities as input often generate disjoint clusters from pair-wise dissimilarities. It is well known that the classification model represented by disjoint clusters is inconsistent with many real classifications, such gene function classifications. We develop a Poclustering algorithm, which generates overlapping clusters from pair-wise dissimilarities. We prove that by allowing overlapping clusters, Poclustering fully preserves the information of any dissimilarity matrices while traditional partitioning algorithms may cause significant information loss.

An innovative clustering approach utilizing frequent item sets

Constraint-based Clustering by Fast Search and Find of Density Peaks

Mining Noise-Tolerant Frequent Closed Itemsets in Very Large Database.

An equidistance index intuitionistic fuzzy c-means clustering algorithm based on local density and membership degree boundary

Sampling Fuzzy K-Means Clustering Algorithm Based on Clonal Optimization

Fuzzy c-Shape: A new algorithm for clustering finite time series waveforms

Weighted Intuitionistic Fuzzy C-Means Clustering Algorithms

New approaches for clustering high dimensional data

Density-based IFCM along with its interval valued and probabilistic extensions, and a review of intuitionistic fuzzy clustering methods

Interval-valued possibilistic fuzzy C-means clustering algorithm

Towards Federated Clustering: A Federated Fuzzy $c$-Means Algorithm (FFCM)

CSFC: A New Centroid Based Clustering Method to Improve the Efficiency of Storing and Accessing Small Files in Hadoop

Deep Fair Clustering via Maximizing and Minimizing Mutual Information: Theory, Algorithm and Metric

A Hierarchical Clustering Algorithm Based on Fuzzy Graph Connectedness.

Clustering by Heterogeneous Data Fusion : Framework and Applications

A possibilistic Fuzzy c-means algorithm based on improved Cuckoo search for data clustering

Accelerated Fuzzy C-Means Clustering Based on New Affinity Filtering and Membership Scaling

Improving K-means clustering with enhanced Firefly Algorithms

A New Clustering Classification Approach Based on FCR.

An Improved Fcm Clustering Method For Interval Data

k'-Means algorithms for clustering analysis with frequency sensitive discrepancy metrics