Site abstraction for rare category classification in large-scale web directory.

Tie-Yan Liu,Hao Wan,Tao Qin,Zheng Chen,Yong Ren,Wei-Ying Ma
DOI: https://doi.org/10.1145/1062745.1062892
2005-01-01
Abstract:Automatically classifying the Web directories is an effective way to manage Web information. However, our experiments showed that the state-of-the-art text classification technologies could not lead to acceptable performance in this task. Due to our analysis, the main problem is the lack of effective training data in rare categories of Web directories. To tackle this problem, we proposed a novel technology named Site Abstraction to synthesize new training examples from the website of the existing training document. The main idea is to propagate features through parent-child relationship in the sitemap tree. Experiments showed that our method significantly improved the classification performance.
What problem does this paper attempt to address?