Improved fuzzy C-means clustering algorithm based on t-SNE for terahertz spectral recognition

Cancan Yi,Shuai Tuo,Shan Tu,Wentao Zhang
DOI: https://doi.org/10.1016/j.infrared.2021.103856
IF: 2.997
2021-01-01
Infrared Physics & Technology
Abstract:Terahertz (THz) waves characterized by low energy, instantaneity and proficiency in spectral analysis, have promising future in material identification. The main dimensionality reduction methods of Terahertz-based material identification, including Principal Component Analysis (PCA), Local Preserving Projection (LPP) and Local Linear Embedding (LLE), are sensitive to the number of nearest neighbor samples and neglect difference among classes, thus making it difficult to design the subsequent clustering model or leading to incorrect clustering. The t-distributed Stochastic Neighbor Embedding (t-SNE), which regards the sample distribution in the high-dimension as a Gaussian distribution and the coordinates in the low-dimension as t-distribution, makes the distance of clusters with long distance longer and then relieve their congestion. Besides, in the traditional Fuzzy C-means (FCM) clustering methods, the initial clustering center is randomly determined, so the problem of local optimum is easy to appear, which is prone to cause wrong recognition. To solve the above problems, an improved FCM algorithm is proposed in this paper for Terahertz spectral recognition. Firstly, t-SNE method is used for dimensional reduction and for the selection of initial clustering center for a more accurate clustering effect. On this basis, classical FCM clustering is used to recognize different substances through Terahertz spectrum. The algorithm can not only relieve the congestion among classes in the process of clustering, but also reflect the distance there for an appropriate cluster center in samples. In order to verify the reliability of the proposed method, the Terahertz time-domain spectroscopy is used to detect three genetically modified cotton seeds of lumianyan28, lumianyan29 and lumianyan36 respectively, obtaining their time-domain spectral data. It is the proposed method, which is used to analyze the spectral data, that successfully distinguishes three different types of transgenic cotton seeds, with a total accuracy of 0.9668. The result shows that the clustering method proposed in the paper has a bright prospect in identifying the Terahertz spectrum of materials.
What problem does this paper attempt to address?