Abstract:In this paper we propose a Deep Autoencoder MIxture Clustering (DAMIC) algorithm based on a mixture of deep autoencoders where each cluster is represented by an autoencoder. A clustering network transforms the data into another space and then selects one of the clusters. Next, the autoencoder associated with this cluster is used to reconstruct the data-point. The clustering algorithm jointly learns the nonlinear data representation and the set of autoencoders. The optimal clustering is found by minimizing the reconstruction loss of the mixture of autoencoder network. Unlike other deep clustering algorithms, no regularization term is needed to avoid data collapsing to a single point. Our experimental evaluations on image and text corpora show significant improvement over state-of-the-art methods.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the clustering problem of high - dimensional data. Specifically, the author proposes a clustering algorithm based on the deep autoencoder mixture model (Deep Autoencoder Mixture Clustering, DAMIC), aiming to overcome the common data collapse problem in existing deep clustering algorithms and not requiring additional regularization terms to avoid this collapse. ### Problem Background In traditional clustering methods, such as k - means, when dealing with high - dimensional data, since the distance information in high - dimensional space becomes less useful, the clustering effect is not good. In recent years, with the successful application of deep learning, many studies have attempted to use deep neural networks for clustering, for example, mapping data to a low - dimensional feature space through autoencoders, variational autoencoders or generative adversarial networks (GAN) and then performing clustering. However, these methods usually need to introduce regularization terms to prevent data from collapsing into a single point, which increases the complexity of parameter tuning and may affect the accuracy of clustering. ### Solutions Proposed in the Paper The main innovations of the DAMIC algorithm include: 1. **Each cluster is represented by an autoencoder**: Different from traditional methods, DAMIC uses autoencoders instead of a single centroid vector to represent each cluster. This can represent the data structure within the cluster more richly. 2. **No need for regularization terms**: DAMIC optimizes the clustering results by minimizing the reconstruction error, thus avoiding the data collapse problem. Therefore, there is no need to adjust the regularization terms separately for each data set. 3. **Joint training mechanism**: DAMIC trains the clustering network and the autoencoder simultaneously to ensure that they work together and improve the clustering performance. ### Formula Summary The core loss function of the DAMIC algorithm is as follows: \[ L(\theta_1,\ldots,\theta_k,\theta_c)=-\sum_{t = 1}^{n}\log\left(\sum_{i = 1}^{k}p(c_t = i|x_t;\theta_c)\exp(-d(x_t,f_i(x_t;\theta_i)))\right) \] where: - \(p(c_t = i|x_t;\theta_c)\) is the probability that the input \(x_t\) is assigned to the \(i\)-th cluster, and the calculation formula is: \[ p(c = i|x;\theta_c)=\frac{\exp(w_i h(x)+b_i)}{\sum_{j = 1}^{k}\exp(w_j h(x)+b_j)} \] - \(d(x_t,f_i(x_t;\theta_i))\) is the reconstruction error of the \(i\)-th autoencoder for \(x_t\), which is defined as: \[ d(x_t,f_i(x_t;\theta_i))=\frac{1}{2}\|x_t - f_i(x_t;\theta_i)\|^2 \] ### Experimental Verification The paper verifies the effectiveness of the DAMIC algorithm through multiple standard data sets (such as MNIST, Fashion, 20NEWS and RCV1). The experimental results show that DAMIC is superior to other existing methods in evaluation indicators such as NMI, ARI and ACC. ### Conclusion By introducing a hybrid model based on autoencoders, the DAMIC algorithm not only solves the data collapse problem, but also improves the clustering performance and is suitable for various types of high - dimensional data clustering tasks.

Deep Clustering Based on a Mixture of Autoencoders

Joint Optimization of an Autoencoder for Clustering and Embedding

Deep clustering based on embedded auto-encoder

Deep Discriminative Latent Space for Clustering

Deep Embedding Clustering Based on Residual Autoencoder

Deep Embedded K-Means Clustering

Deep Spectral Clustering using Dual Autoencoder Network

Adversarial Deep Embedded Clustering: on a better trade-off between Feature Randomness and Feature Drift

Deep Clustering by Gaussian Mixture Variational Autoencoders With Graph Embedding

Deep clustering with fusion autoencoder

Deep Image Clustering Using Convolutional Autoencoder Embedding With Inception-Like Block

AugDMC: Data Augmentation Guided Deep Multiple Clustering

DAC: Deep Autoencoder-based Clustering, a General Deep Learning Framework of Representation Learning

Deep Density-based Image Clustering

Pseudo-supervised Deep Subspace Clustering

An autoencoder-like deep NMF representation learning algorithm for clustering

Deep subspace clustering to achieve jointly latent feature extraction and discriminative learning

Deep Continuous Clustering

Deep Amortized Clustering

Manifold-Aware Deep Clustering: Maximizing Angles between Embedding Vectors Based on Regular Simplex