Analysis and Research on Cross Language Topic Discovery in Chinese and English

Xingshu CHEN,Liang LUO,Haizhou WANG,Wenxian WANG,Yue GAO
DOI: https://doi.org/10.15961/j.jsuese.201601032
2017-01-01
Abstract:With the rapid development of the Internet under the background of globalization,mining network data for cross-language texts has become one of the most popular research fields in public opinion analysis.Detecting hot topics effectively and timely for texts both in Chinese and English plays a crucial role in grasping the development of public opinion.Internet news,as an important part of the Internet public opinion,has become a significant source of information acquisition for netizens.Firstly,Internet news in Chinese and English network were collected.Secondly,the ICE-LDA model based on LDA model was proposed to detect co-occurrence topics of the mixed dataset.Then,the JS distance and cosine similarity of the topic-text distribution were used to calculate the distance between two topics in ICE-LDA model.Thirdly,a contrastive parallel corpus and a non-colligative corpus were constructed respectively for Chinese and English mixed news data.During model building,the TF-IDF algorithm was used to remove noise words of the text.Finally,two kinds of topic vectors were used to detect the co-occurrence topics.The experimental results showed that the improved topic model proposed by us can not only detect topics in the comparison corpus dataset but also in the non-comparison corpus dataset.
What problem does this paper attempt to address?