Extracting Domain-Specific Terms From Unlabeled Web Documents By Bootstrapping And Term Classifiers

Tao Liu,Xiao-Long Wang,Bing-Quan Liu,Yuan-Chao Liu,Ming-Hui Li
DOI: https://doi.org/10.1109/ICSMC.2007.4413834
2007-01-01
Abstract:Domain-specific term extraction contributes to all domain-oriented natural language processing tasks. Given a small set of domain-specific terms as seed terms, new terms from unlabeled corpora can be extracted by bootstrapping a term classifier to discover the association between seed terms and new terms. Traditional term representation method for domain-specific term extraction represents a term in a feature space of documents, which depicts association of terms which share common documents. This representation can't depict the inner-document information of terms and requires extracted terms to occur in multiple documents. A new term representation method in global contextual space is proposed for domain-specific term extraction in this paper. This representation mechanism depicts the association of terms which share common global contexts. The information of terms within certain document and among corpora is depicted by global contexts. Experiments on Chinese web corpus show that the proposed domain-specific term extraction method with global contextual representation outperforms traditional method with representation mechanism in documents space. The improvement for low frequency terms is much higher for the proposed method.
What problem does this paper attempt to address?