Abstract:Text classification, as an important research area of text mining, can quickly and effectively extract valuable information to address the challenges of organizing and managing large-scale text data in the era of big data. Currently, the related research on text classification tends to focus on the application in fields such as information filtering, information retrieval, public opinion monitoring, and library and information, with few studies applying text classification methods to the field of tourist attractions. In light of this, a corpus of tourist attraction description texts is constructed using web crawler technology in this paper. We propose a novel text representation method that combines Word2Vec word embeddings with TF-IDF-CRF-POS weighting, optimizing traditional TF-IDF by incorporating total relative term frequency, category discriminability, and part-of-speech information. Subsequently, the proposed algorithm respectively combines seven commonly used classifiers (DT, SVM, LR, NB, MLP, RF, and KNN), known for their good performance, to achieve multi-class text classification for six subcategories of national A-level tourist attractions. The effectiveness and superiority of this algorithm are validated by comparing the overall performance, specific category performance, and model stability against several commonly used text representation methods. The results demonstrate that the newly proposed algorithm achieves higher accuracy and F1-measure on this type of professional dataset, and even outperforms the high-performance BERT classification model currently favored by the industry. Acc, marco-F1, and mirco-F1 values are respectively 2.29%, 5.55%, and 2.90% higher. Moreover, the algorithm can identify rare categories in the imbalanced dataset and exhibit better stability across datasets of different sizes. Overall, the algorithm presented in this paper exhibits superior classification performance and robustness. In addition, the conclusions obtained by the predicted value and the true value are consistent, indicating that this algorithm is practical. The professional domain text dataset used in this paper poses higher challenges due to its complexity (uneven text length, relatively imbalanced categories), and a high degree of similarity between categories. However, this proposed algorithm can efficiently implement the classification of multiple subcategories of this type of text set, which is a beneficial exploration of the application research of complex Chinese text datasets in specific fields, and provides a useful reference for the vector expression and classification of text datasets with similar content.

Parallel Text Categorization of Massive Text Based on Hadoop

Research on Parallelized Sentiment Classification Algorithms

Parallel Topic Model and Its Application on Document Clustering.

Implementation of large-scale distributed information retrieval system

Fast text categorization based on collaborative work in the semantic and class spaces

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

Parallel Non-negative Matrix Tri-Factorization for Text Data Co-clustering

Collaborative Work with Linear Classifier and Extreme Learning Machine for Fast Text Categorization

Evaluating Large Graph Processing in MapReduce Based on Message Passing

Parallel Image Texture Feature Extraction Under Hadoop Cloud Platform

Parallel Approach and Platform for Large-Scale WEB Data Extraction

Parallel naive Bayes algorithm for large-scale Chinese text classification based on spark

Dimensionality Reduction With Category Information Fusion And Non-Negative Matrix Factorization For Text Categorization

Massive Image Data Management Using Hbase And Mapreduce

Parallel Sentiment Polarity Classification Method with Substring Feature Reduction

Content-Aware Partial Compression for Textual Big Data Analysis in Hadoop

Parallelization of Classification Algorithms Based on SparkR

A Study of Text Vectorization Method Combining Topic Model and Transfer Learning

Text classification algorithm of tourist attractions subcategories with modified TF-IDF and Word2Vec

Data Mining Algorithm for Cloud Network Information Based on Artificial Intelligence Decision Mechanism

Hierarchical Taxonomy Preparation for Text Categorization Using Consistent Bipartite Spectral Graph Copartitioning