Shlomo E. Chazan,Sharon Gannot,Jacob Goldberger
Abstract:In this paper we propose a Deep Autoencoder MIxture Clustering (DAMIC) algorithm based on a mixture of deep autoencoders where each cluster is represented by an autoencoder. A clustering network transforms the data into another space and then selects one of the clusters. Next, the autoencoder associated with this cluster is used to reconstruct the data-point. The clustering algorithm jointly learns the nonlinear data representation and the set of autoencoders. The optimal clustering is found by minimizing the reconstruction loss of the mixture of autoencoder network. Unlike other deep clustering algorithms, no regularization term is needed to avoid data collapsing to a single point. Our experimental evaluations on image and text corpora show significant improvement over state-of-the-art methods.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the clustering problem of high - dimensional data. Specifically, the author proposes a clustering algorithm based on the deep autoencoder mixture model (Deep Autoencoder Mixture Clustering, DAMIC), aiming to overcome the common data collapse problem in existing deep clustering algorithms and not requiring additional regularization terms to avoid this collapse.
### Problem Background
In traditional clustering methods, such as k - means, when dealing with high - dimensional data, since the distance information in high - dimensional space becomes less useful, the clustering effect is not good. In recent years, with the successful application of deep learning, many studies have attempted to use deep neural networks for clustering, for example, mapping data to a low - dimensional feature space through autoencoders, variational autoencoders or generative adversarial networks (GAN) and then performing clustering.
However, these methods usually need to introduce regularization terms to prevent data from collapsing into a single point, which increases the complexity of parameter tuning and may affect the accuracy of clustering.
### Solutions Proposed in the Paper
The main innovations of the DAMIC algorithm include:
1. **Each cluster is represented by an autoencoder**: Different from traditional methods, DAMIC uses autoencoders instead of a single centroid vector to represent each cluster. This can represent the data structure within the cluster more richly.
2. **No need for regularization terms**: DAMIC optimizes the clustering results by minimizing the reconstruction error, thus avoiding the data collapse problem. Therefore, there is no need to adjust the regularization terms separately for each data set.
3. **Joint training mechanism**: DAMIC trains the clustering network and the autoencoder simultaneously to ensure that they work together and improve the clustering performance.
### Formula Summary
The core loss function of the DAMIC algorithm is as follows:
\[
L(\theta_1,\ldots,\theta_k,\theta_c)=-\sum_{t = 1}^{n}\log\left(\sum_{i = 1}^{k}p(c_t = i|x_t;\theta_c)\exp(-d(x_t,f_i(x_t;\theta_i)))\right)
\]
where:
- \(p(c_t = i|x_t;\theta_c)\) is the probability that the input \(x_t\) is assigned to the \(i\)-th cluster, and the calculation formula is:
\[
p(c = i|x;\theta_c)=\frac{\exp(w_i h(x)+b_i)}{\sum_{j = 1}^{k}\exp(w_j h(x)+b_j)}
\]
- \(d(x_t,f_i(x_t;\theta_i))\) is the reconstruction error of the \(i\)-th autoencoder for \(x_t\), which is defined as:
\[
d(x_t,f_i(x_t;\theta_i))=\frac{1}{2}\|x_t - f_i(x_t;\theta_i)\|^2
\]
### Experimental Verification
The paper verifies the effectiveness of the DAMIC algorithm through multiple standard data sets (such as MNIST, Fashion, 20NEWS and RCV1). The experimental results show that DAMIC is superior to other existing methods in evaluation indicators such as NMI, ARI and ACC.
### Conclusion
By introducing a hybrid model based on autoencoders, the DAMIC algorithm not only solves the data collapse problem, but also improves the clustering performance and is suitable for various types of high - dimensional data clustering tasks.