Multi View Text Categorization Based on Random Forests

田宝明,戴新宇,陈家骏
DOI: https://doi.org/10.3969/j.issn.1003-0077.2009.04.008
2009-01-01
Abstract:Term-based Vector Space Model(VSM) is a traditional approach to representing documents,which defects in its neglecting of the relations between terms.To capture the relations between the terms,some latent topicsbased document representations such as LDA(Latent Dirichlet Allocation) have arisen much attention recently. However,simple latent topic-based text representations may cause loss of information carried by terms.In this paper, we use a modified random forests method to combine the term based and the LDA latent topic based documents representation.Random forests are constructed separately for two kinds of text representations and the final classification result is decided by vote scheme.The experimental results on some standard datasets show that,compared with methods only using one set of text features,our method can efficiently combine two kinds of text representations and improve the performance of text categorization.
What problem does this paper attempt to address?