Heterogeneous Latent Topic Discovery for Semantic Text Mining
Yawen Li,Di Jiang,Rongzhong Lian,Xueyang Wu,Conghui Tan,Yi Xu,Zhiyang Su
DOI: https://doi.org/10.1109/tkde.2021.3077025
IF: 9.235
2021-01-01
IEEE Transactions on Knowledge and Data Engineering
Abstract:In order to mine latent semantics from text data, word embedding and topic modeling are two major methodologies in the industry. From a pragmatic perspective, each of these two lines of semantic models faces increasing challenges from real-life applications. Topic modeling view documents as bags of words and is unable to capture the sequential relationship between words. On the other hand, word embedding models the co-occurrence of neighboring words but lacks the global view of the document. Therefore, they can only discover homogenous semantics from a single aspect. However, modern text mining tasks typically require a panoramic view of the latent semantics. Hence, discovering heterogeneous semantics (e.g., heterogeneous types of latent topics) is critical for the performance of these tasks, and it is necessary to design a model that meets this demand. Furthermore, with the arrival of the big data era and the increasing awareness of data privacy, it is necessary to study mining heterogeneous semantics with high efficiency while avoiding compromising data privacy. In this work, we develop a novel method called Heterogeneous Latent Topic Discovery (HLTD) which seamlessly integrates topic modeling with word embedding to discover heterogeneous latent topics. By coupling parameter-server architecture with new private sampling algorithms, HLTD can be efficiently trained to protect underlying data privacy. We evaluate HLTD through a wide range of qualitative and quantitative metrics in the industry. Extensive experiments demonstrate the superiority of HLTD over the state-of-the-arts.
computer science, information systems, artificial intelligence,engineering, electrical & electronic