Study on Tables Information Extraction Based on Web

秦振海,谭守标,徐超
DOI: https://doi.org/10.3969/j.issn.1673-629x.2010.02.057
2010-01-01
Abstract:Nowadays,web becomes the main information resource.According to the report,tables are used frequently in web documents.Since tables are inherently concise as well as information rich,the automatic understanding of tables has many applications including knowledge management,information retrieval,web mining and so on.Study on tables information extraction based on web has an important practical significance.A large amount of information available on the web is formatted in HTML tables,which are not content-oriented,and are not suitable for understanding and query by machines.In this paper,firstly transform HTML documents to XML documents and combinate ontology to discover heuristics.Then two key technologies are analysed,including web table detection,web table structure recognition.On this basis,we normalize the HTML tables according to the attributes of HTML tables and thus this approach is appropriate to extracte complicated tables information.
What problem does this paper attempt to address?