An enhanced Teaching-Learning-Based Optimization (TLBO) with Grey Wolf Optimizer (GWO) for text feature selection and clustering

Mahsa Azarshab,Mohammad Fathian,Babak Amiri
2024-02-19
Abstract:Text document clustering can play a vital role in organizing and handling the everincreasing number of text documents. Uninformative and redundant features included in large text documents reduce the effectiveness of the clustering algorithm. Feature selection (FS) is a well-known technique for removing these features. Since FS can be formulated as an optimization problem, various meta-heuristic algorithms have been employed to solve it. Teaching-Learning-Based Optimization (TLBO) is a novel meta-heuristic algorithm that benefits from the low number of parameters and fast convergence. A hybrid method can simultaneously benefit from the advantages of TLBO and tackle the possible entrapment in the local optimum. By proposing a hybrid of TLBO, Grey Wolf Optimizer (GWO), and Genetic Algorithm (GA) operators, this paper suggests a filter-based FS algorithm (TLBO-GWO). Six benchmark datasets are selected, and TLBO-GWO is compared with three recently proposed FS algorithms with similar approaches, the main TLBO and GWO. The comparison is conducted based on clustering evaluation measures, convergence behavior, and dimension reduction, and is validated using statistical tests. The results reveal that TLBO-GWO can significantly enhance the effectiveness of the text clustering technique (K-means).
Machine Learning,Neural and Evolutionary Computing,Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the inefficiency and redundancy issues in text feature selection and clustering. Specifically, large text data contains a significant amount of non-informative or redundant features, which can reduce the effectiveness of clustering algorithms. Therefore, the paper proposes a new hybrid algorithm based on Teaching-Learning-Based Optimization (TLBO) and Grey Wolf Optimization (GWO) (TLBO-GWO) for text feature selection, and uses the K-means clustering method to improve the effectiveness of text clustering. ### Main Issues of the Paper 1. **Text Feature Selection**: Text data contains a large number of non-informative and redundant features, which can degrade the performance of clustering algorithms. 2. **Improvement of Clustering Effectiveness**: Existing feature selection methods perform poorly when dealing with high-dimensional text data, requiring a more effective feature selection method to improve clustering effectiveness. ### Solution 1. **Hybrid Algorithm (TLBO-GWO)**: - **TLBO**: Teaching-Learning-Based Optimization algorithm, which has the advantages of fewer parameters and fast convergence. - **GWO**: Grey Wolf Optimization algorithm, which can effectively avoid local optima. - **Genetic Operators**: Introduce crossover and mutation operations from genetic algorithms to enhance the exploration and exploitation capabilities of the algorithm. 2. **Feature Selection Process**: - Preprocess each document, including tokenization, stop-word removal, stemming, and term weighting. - Use the TLBO-GWO algorithm to select the most informative features for each document. - Merge the selected features of all documents to form a global feature subset. 3. **Clustering Process**: - Use the K-means clustering algorithm to cluster the updated text dataset. - Validate the effectiveness of the algorithm through clustering evaluation metrics, convergence behavior, and dimensionality reduction capability. ### Experimental Validation The paper selected six benchmark datasets and compared TLBO-GWO with three recently proposed feature selection algorithms as well as the main TLBO and GWO. The results show that TLBO-GWO exhibits significant advantages in clustering effectiveness, convergence behavior, and dimensionality reduction. ### Conclusion The paper proposes a new hybrid algorithm based on TLBO and GWO for text feature selection and clustering. Experimental results show that the algorithm can significantly improve the effectiveness of text clustering, especially when dealing with high-dimensional text data.