Content Extraction Algorithm of HTML Pages Based on Optimized Weight

Qi Wu,Xing-shu Chen,Jun Tan
DOI: https://doi.org/10.3969/j.issn.1000-565X.2011.04.006
2011-01-01
Abstract:With the increase in advertisement amount in HTML pages, it becomes more and more difficult to extract content accurately. In order to solve this problem, an algorithm of content extraction from HTML pages is proposed based on optimized weight. In this algorithm, first, the features of the content are analyzed to obtain the statistical features of the attributes by analyzing the characteristics of the content block in web pages. Then, in view of different importance of the features, the weight and threshold of the features are optimized by using the particle swarm optimization algorithm, which further improves the performance of the algorithm. Finally, some experiments are performed to verify the effectiveness of the algorithm. The results show that, as compared with the algorithm with un-optimized weight, the proposed algorithm improves the recall rate of content extraction to 95.8% without reducing the precision.
What problem does this paper attempt to address?