Refine the Corpora Based on Document Manifold.

Chengwei Yao,Yilin Wang,Gencai Chen
DOI: https://doi.org/10.1007/978-3-642-53914-5_27
2013-01-01
Abstract:Nowadays, it is quite challenging to track and utilize overwhelming news information generated by internet. One approach is using topic models, such as pLSI, LDA, LPI, LapPLSI, LTM etc, to discover news topics automatically. However, in many real applications, the topics inferred by all these kinds of models are not much useful, because there are always a proportion of the documents actually belong to no topics. In this paper, we proposed a new technique to refine the document corpora before topic modeling. Inspired by manifold theory, we use Laplacian eigenmaps to discover the submanifold structure of the document space, and try to find those documents with loose relations to other documents, then exclude them from the corpora. Experiments show that topic models combined with our algorithm can improve the quality of the topics significantly. © Springer-Verlag 2013.
What problem does this paper attempt to address?