The Web information extraction technology research based on XML description

Shai Fei,Wang Jia,Pan Chao
2007-01-01
Abstract:The Internet has become the people's access to information, services made one of the important channels. The data’s biggest feature on the Web is of semi-structured. Since the main organization forms of information on the network is HTML format, and HTML Markup Language only describe the manifestations of data, do not describe the meaning and structure . Therefore, the computer can not be automaticly identified. And XML is a kind of semantic -oriented language, its generation provided the conditions to solve this problem, in another words,the latter is better to identified by computer. Therefore this paper analyzed said data extraction process by the identification of XML.
What problem does this paper attempt to address?