Deep clustering based on embedded auto-encoder

Xuan Huang,Zhenlong Hu,Lin Lin
DOI: https://doi.org/10.1007/s00500-021-05934-8
IF: 3.732
2021-06-18
Soft Computing
Abstract:Deep clustering is a new research direction that combines deep learning and clustering. It performs feature representation and cluster assignments simultaneously, and its clustering performance is significantly superior to traditional clustering algorithms. The auto-encoder is a neural network model, which can learn the hidden features of the input object to achieve nonlinear dimensionality reduction. This paper proposes the embedded auto-encoder network model; specifically, the auto-encoder is embedded into the encoder unit and the decoder unit of the prototype auto-encoder, respectively. To ensure effectively cluster high-dimensional objects, the encoder of model first encodes the raw features of the input objects, and obtains a cluster-friendly feature representation. Then, in the model training stage, by adding smoothness constraints to the objective function of the encoder, the representation capabilities of the hidden layer coding are significantly improved. Finally, the adaptive self-paced learning threshold is determined according to the median distance between the object and its corresponding the centroid, and the fine-tuning sample of the model is automatically selected. Experimental results on multiple image datasets have shown that our model has fewer parameters, higher efficiency and the comprehensive clustering performance is significantly superior to the state-of-the-art clustering methods.
computer science, artificial intelligence, interdisciplinary applications
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper proposes a new deep clustering method—Embedded Auto-Encoder Cluster (EmAEC), aiming to address the effective clustering of high-dimensional data. Specifically: 1. **Challenges of High-Dimensional Data Clustering**: - Traditional clustering methods face the problem of "curse of dimensionality" when dealing with high-dimensional data, making it difficult to find meaningful clustering results. - The feature space of high-dimensional data grows exponentially, significantly increasing the complexity of the model. 2. **Combining Deep Learning with Clustering**: - Utilizing deep neural networks (DNN) to extract high-level features of input objects and learning effective feature representations through a large number of input objects, followed by clustering. - Deep clustering can significantly improve clustering performance, outperforming traditional clustering algorithms. 3. **Embedded Auto-Encoder Architecture**: - Proposing a novel embedded auto-encoder architecture that embeds the auto-encoder into the encoding and decoding units of the prototype auto-encoder. - This embedded auto-encoder can perform an encoding-decoding operation before the final encoding, achieving more effective dimensionality reduction and improving feature representation capability. 4. **Model Training Improvements**: - During the model training phase, applying smooth constraints to the hidden layer encoding to obtain smoother and more continuous intermediate layer manifolds, significantly enhancing the representation capability of the hidden layer encoding. - Introducing an adaptive self-paced learning strategy that automatically selects fine-tuning samples during the fine-tuning process and determines the threshold based on the distance between the sample and its corresponding centroid, preventing boundary samples from participating in training, thus achieving better convergence. 5. **Experimental Validation**: - Experimental results on multiple image datasets show that this model has fewer parameters, higher efficiency, and significantly better overall clustering performance compared to existing advanced clustering methods. In summary, the main goal of this paper is to propose a deep clustering method capable of effectively handling high-dimensional data, utilizing embedded auto-encoders and adaptive self-paced learning strategies to significantly enhance clustering performance.