Research on a Text Data Preprocessing Method Suitable for Clustering Algorithm

Chunlin Wang,Neng Yang,Wanjin Xu,Junjie Wang,Jianyong Sun,Xiaolin Chen
DOI: https://doi.org/10.1109/ispds56360.2022.9874172
2022-01-01
Abstract:In the clustering process, the eigenvalues in the data set have mixed type attributes such as numerical and text, and the measurement methods are inconsistent. In this paper, the distance between samples is easily affected by the eigenvalues of a certain dimension. This includes affecting clustering performance and the inability of continuous algorithms to deal with discrete data. These two problems focus on two points in the algorithm of this paper. First, each characteristic attribute of the dataset is analyzed. The type and number of ranges for each attribute is counted. Attributes that are not affected by the clustering algorithm are deleted. Secondly, the text feature attributes with more than 2 range are extended to multiple new feature attributes. Each attribute has only two value fields, replaced by 0 or 1 respectively. This approach makes all textual and numeric attributes use a uniform metric. This method was used to preprocess the mushroom dataset. This keeps the values in the dataset in the same range. Clustering algorithm is used to classify it. In the experiment, the classification accuracy of k-means++ algorithm is improved from 70.9% to 89.2% compared with LabelEncoder method. It also applies to more algorithms. This proves that our method works.
What problem does this paper attempt to address?