Research on Methods of Parsing and Classification of Internet Super Large-scale Texts

Mengting Song,Hang Zheng,Zhen Tao,Jia Jiang,Bin Pan
DOI: https://doi.org/10.1088/1742-6596/1757/1/012121
2021-01-01
Journal of Physics: Conference Series
Abstract:Abstract Web crawlers are an important part of modern search engines. With the development of the times, data has shown explosive growth, and mankind has entered a “big data era”. For example, Wikipedia, which carries knowledge achievements from all over the world, records real-time news that occurs every day and provides users with a good text search database[1]. Wikipedia updates data up to 50+GB every day. This project focuses on solving the problems of data acquisition and data analysis. At the same time, it downloads and parses the latest data of Wikipedia and analyzes XML files, and then uses SVM algorithm and Naive Bayes algorithm to classify articles, Train the model to download Wikipedia files efficiently and parse XML files.
What problem does this paper attempt to address?