Web News Extraction Via Tag Path Feature Fusion Using DS Theory

Gong-Qing Wu,Lei Li,Li,Xindong Wu
DOI: https://doi.org/10.1007/s11390-016-1655-1
IF: 1.871
2016-01-01
Journal of Computer Science and Technology
Abstract:Contents, layout styles, and parse structures of web news pages differ greatly from one page to another. In addition, the layout style and the parse structure of a web news page may change from time to time. For these reasons, how to design features with excellent extraction performances for massive and heterogeneous web news pages is a challenging issue. Our extensive case studies indicate that there is potential relevancy between web content layouts and their tag paths. Inspired by the observation, we design a series of tag path extraction features to extract web news. Because each feature has its own strength, we fuse all those features with the DS (Dempster-Shafer) evidence theory, and then design a content extraction method CEDS. Experimental results on both CleanEval datasets and web news pages selected randomly from well-known websites show that the F 1-score with CEDS is 8.08% and 3.08% higher than existing popular content extraction methods CETR and CEPR-TPR respectively.
What problem does this paper attempt to address?