Automatic Corpora Construction for Text Classification.

Dandan Wang,Qingcai Chen,Xiaolong Wang,Bingyang Yu
2013-01-01
Abstract:Since the machines become more and more intelligent, it is reasonable to expect the automatic construction of text classifiers by given just the objective categories. As trade-off solutions, existing researches usually provide additional information to the category terms to enhance the performance of a classifier. Unique from them, in this paper, we construct the standard corpora from the web by just providing text categories. Since there are millions of manually constructed websites, it is hopeful to find out proper text categorization (TC) knowledge. So we directly go to the web and use the hierarchies implied in navigation bars to extract and verify TC resources. By addressing the issues of navigation bar recognition and text filtering, the corpora are constructed for given text categories and the classifiers are trained based on them. We conduct our experiments on the large scale of webpages collected from the 500 top English websites on Alexa. The Open Directory Project (ODP) is used as testing corpus. Experimental results show that, being compared with the classifier based on manually labeled corpus, the classifier trained on auto constructed corpora reaches comparable performance for the categories that are well covered by the training corpus.
What problem does this paper attempt to address?