Web Data Extraction from Scientific Publishers' Website Using Hidden Markov Model.

Jing Huang,Ziyu Liu,Beibei Wang,Mingyue Duan,Bo Yang
DOI: https://doi.org/10.1007/978-3-319-99365-2_42
2018-01-01
Abstract:Recently, large amounts of information on web pages have been emerging in an endless stream. And numerously papers are published on more than three thousands of journals, especially in the field of technology. It’s almost impossible for the user to search the information one by one. The user has to click a lot of links when he or she wants to get information among the thousands of journals, such as the introduction of the journals, impact factor, ISSN and so on. To solve this problem, it’s necessary to develop an automatic method that filter the information out of deep web automatically. The method in this paper is able to help people quickly get needed information classified and extracted. This paper contains the following work: firstly, the method of machine learning, HMM, is used to extract the journal information from the publisher’s website, which improves the generalization ability of using the heuristic method; then, during the data processing step, content extraction technique is used to improve the performance of Hidden Markov Model; finally, we store the extracted information in a structured way and display it. In the experimental step, three algorithms are tested and compared in the accuracy, recall and F-measure, the results show that HMM with content extraction (C-HMM) has the best performance.
What problem does this paper attempt to address?