Study On Method Of Web Content Mining For Non-Xml Documents

Jianguo Chen,Hao Chen,Jie Guo
DOI: https://doi.org/10.1007/978-3-642-16339-5_31
2010-01-01
Abstract:Web content mining is an important way of Internet information collection and analysis, but most of web pages are non-XML documents, how to extract useful information efficiently from massive web pages is a interesting research topic. On the basis of analyzing the features of web content mining, a XML-based web content mining method is proposed. Firstly, it defines the authority web page using the HITS algorithms, then transforms the non-XML documents into structured XML documents after the data cleaning and extracting by HTML Tidy, finally does data mining on the XML document using text clustering techniques. A science paper web site is chosen as a case study for Web content extracting. Experimental results show that the proposed method works well, it can extract web content efficiently and automatically.
What problem does this paper attempt to address?