Dirichlet Mixture Allocation for Multiclass Document Collections Modeling

Wei Bian,Dacheng Tao
DOI: https://doi.org/10.1109/ICDM.2009.102
2009-01-01
Abstract:Topic model, latent Dirichlet allocation (LDA), is an effective tool for statistical analysis of large collections of documents. In LDA, each document is modeled as a mixture of topics and the topic proportions are generated from the unimodal Dirichlet distribution prior. When a collection of documents are drawn from multiple classes, this unimodal prior is insufficient for data fitting. To solve this problem, we exploit the multimodal Dirichlet mixture prior, and propose the Dirichlet mixture allocation (DMA). We report experiments on the popular TDT2 Corpus demonstrating that DMA models a collection of documents more precisely than LDA when the documents are obtained from multiple classes.
What problem does this paper attempt to address?