The Technology of Extracting Content Information from Web Page Based on DOM Tree
Dingrong Yuan,Zhuoying Mo,Bing Xie,Yangcai Xie
DOI: https://doi.org/10.1007/978-3-642-20370-1_45
2011-01-01
Abstract:There are huge amounts of information on Web pages. which includes content information and other useless information, such as navigation, advertisement and flash of animation etc. Reducing the toils of Web users, we estabished a thechnique to extract the content information from web page. Fristly, we analyzed the semantic of web documents by V8 engine of Google and parsed the web document into DOM tree. And then, traversed the DOM tree, pruned the DOM tree in the light of the characteristic of Web page's edit language. Finally, we extracted the content information from Web page. Theoretics and experiments showed that the technique could simplify the web page, present the content information to web users and supply clean data for applicable area, such as retrieval, KDD and DM from web.