HIDDEN WEBPAGE INFORMATION EXTRACTION ALGORITHM USING DOM STATE TRANSFER

Yong Fang,Yinsheng Li
DOI: https://doi.org/10.3969/j.issn.1000-386x.2015.09.005
2015-01-01
Abstract:A great deal of dynamic JavaScript containing in webpages leads to most of the webpage contents being invisible to traditional webpage crawlers.Therefore we proposed a DOM state transfer-based hidden webpage information extraction algorithm.The algorithm incrementally constructs the DOM state transfer machine and uses DOM nodes and their click events as the inputting events of transfer machine.For the transfer paths which can result in the variation of object nodes,recursive search will be done;By the playback of click path it automatically completes contents grasping of the object nodes;By covering the prototype of audiomonitor method it obtains all the clickable nodes in DOM tree as the candidate nodes.The algorithm employs RTDM algorithm and self-defined filter to compress DOM state space in order to shrink the search space,and carries out heuristic search by defining the distance between candidate nodes in DOM tree and object nodes as the h marking.Experiment demonstrated that the algorithm studied has excellent performance,it achieved 89.48% accuracy in hidden webpage content extraction,and could be used in the fields of automatic webpage test and webpage crawler,etc.
What problem does this paper attempt to address?