Tag Tree Template for Web Information and Schema Extraction.
Xiangwen Ji,Jianping Zeng,Shiyong Zhang,Chengrong Wu
DOI: https://doi.org/10.1016/j.eswa.2010.05.027
IF: 8.5
2010-01-01
Expert Systems with Applications
Abstract:The process of information extraction from Web is both interesting and challenging, which could be helpful in Web Searching, Information Retrieval and Web Mining. Web pages on many sites are produced dynamically as structural records based on a HTML template from a background database. To efficiently extract meaningful information including records and data schema from the kind of pages, a new method based on Tag tree template is proposed. Web pages from different Web sites are parsed into Tag trees, and then templates of each site are generated from the trees by using a cost-based tree similarity measurement. The exclusive content in each page is then extracted by using the templates to parse the page. Finally, the records in pages and the schema of the records can be extracted from the exclusive content by finding repeating patterns and using some heuristic rules. The extraction experiments on 360 pages from 12 Web sites are performed, and the result shows that the proposed method is an effective way to extract meaningful information.