An Algorithm to Extract and Judge the Main Text Based on the Law of Total Probability

Qingsong Lv,Shulin Cao,Yifan Wang,Qian Yin,Xin Zheng
DOI: https://doi.org/10.2991/emim-17.2017.19
2017-01-01
Abstract:Since Internet web pages have diverse contents and complex structure, it is of great significance to use a uniform algorithm to tackle them. In this paper, we proposed an algorithm called P value algorithm to extract the main text of one webpage. By calculating the P value of each tag in an HTML page, we can locate where the main text is. Moreover, the P value of a web page can also represent the probability of “This web page has main text”. The experiments show that the accuracy of extracting web pages is 95.42% and the accuracy of judging whether a page has main text is 93.98% without any prior knowledge.
What problem does this paper attempt to address?