Abstract:In natural language processing (NLP), text similarity calculation is widely used in information retrieval, machine translation, text mining etc. The definition of the similarity between texts may not just refer to words with similar meanings. Domain similarity, which evaluates the similarity on basis of domain reference, is becoming a promising approach in dealing with large documents. By adopting domain similarity calculation, the degree of similarity could be controlled at different semantic levels, and extract texts in different domain granularity. For example, web pages of Lakers, NBA, basketball and sports could be retrieved respectively with different settings of domain similarity. LSI (Latent Semantic Indexing) is a feasible approach that can be applied to calculate text domain similarity. By controlling the number of topics, the domain similarity can be determined in different granularity. However, the performance is greatly affected by the number of specified topics, which is required for LSI algorithm. In this paper, an adaptive method was applied to word similarity calculation. TF-IDF was used to get the word frequency in the text, and the number of topics in the mixed text, set by dimensionality reduction and clustering was automatically obtained. According to the number of clusters, the similarity between text domains was calculated as the number of topics mapped to the subspace in the LSI. Accordingly, experimental results have shown that the method proposed in this paper is superior to other algorithms in the accuracy of text similarity calculation.

Text Similarity Computing Based on LDA Topic Model and Word Co-occurrence

An adaptive method for text domain similarity calculation

Document Clustering Based on Probabilistic Topic Model

Topic Discovery Based on LDA_col Model and Topic Significance Re-ranking.

Text Similarity Computing Based On Attribute Center Of Gravity Model And Lsa

Short text classification based on LDA topic model

Sentiment word co-occurrence and knowledge pair feature extraction based LDA short text clustering algorithm

Text Clustering Using Enhanced Plsa With Word Correlation

A LDA Model Based Topic Detection Method

An Improved Latent Dirichlet Allocation Method For Service Topic Detection

Text Similarity Measurement of Semantic Cognition Based on Word Vector Distance Decentralization with Clustering Analysis

Short Text Similarity Based on Probabilistic Topics.

Latent Dirichlet Allocation - An approach for topic discovery

A Novel Linguistic Phenomenon Description for Text Similarity Computing

Topic-weak-correlated Latent Dirichlet Allocation

Performance evaluation of Latent Dirichlet Allocation in text mining.

UT-LDA Based Similarity Computing in Microblog

The Similarity Measure Based on LDA for Automatic Summarization

Clust-LDA: Joint Model for Text Mining and Author Group Inference

News Topic Discovery Through Community Detection

Web Text Classification based on LDA Model