DOM-Based Automatic Extraction of Topical Information from Web Pages

王琦,唐世渭,杨冬青,王腾蛟
2004-01-01
Journal of Computer Research and Development
Abstract:Web is a vast resource of information, but its representation limits its availability: the main information in a web page is always hidden among unimportant features such as unnecessary images and extraneous links, and this makes it difficult for the users to acquire the topical information Information extraction can help the users to locate the information of interest A new extraction methodology based on DOM is proposed by transforming DOM trees to STU DOM trees and then processing them with some algorithms A STU DOM tree can be viewed as a DOM tree with some semantic contextual attributes The key algorithm is to filter and prune the STU DOM tree It can automatically and accurately extract the useful and relevant content from HTML documents This approach is a universal method, which is independent of document structures and domains Unlike most approaches, it maintains the structure and content as well Hence the approach is significant and reliable It can be widely applied for web browsing on handheld devices, such as PDAs and mobile phones, and retrieval systems
What problem does this paper attempt to address?