Deep Classification in Large-Scale Text Hierarchies

Gui-Rong Xue,Dikan Xing,Qiang Yang,Yong Yu
DOI: https://doi.org/10.1145/1390334.1390440
2008-01-01
Abstract:Most classification algorithms are best at categorizing the Web documents into a few categories, such as the top two levels in the Open Directory Project. Such a classification method does not give very detailed topic-related class information for the user because the first two levels are often too coarse. However, classification on a large-scale hierarchy is known to be intractable for many target categories with cross-link relationships among them. In this paper, we propose a novel deep-classification approach to categorize Web documents into categories in a large-scale taxonomy. The approach consists of two stages: a search stage and a classification stage. In the first stage, a category-search algorithm is used to acquire the category candidates for a given document. Based on the category candidates, we prune the large-scale hierarchy to focus our classification effort on a small subset of the original hierarchy. As a result, the classification model is trained on the small subset before being applied to assign the category for a new document. Since the category candidates are sufficiently close to each other in the hierarchy, a statistical-language-model based classifier using n-gram features is exploited. Furthermore, the structure of the taxonomy can be utilized in this stage to improve the performance of classification. We demonstrate the performance of our proposed algorithms on the Open Directory Project with over 130,000 categories. Experimental results show that our proposed approach can reach 51.8% on the measure of Mi-F1 at the 5th level, which is 77.7% improvement over top-down based SVM classification algorithms.
What problem does this paper attempt to address?