Abstract:In natural language processing (NLP), text similarity calculation is widely used in information retrieval, machine translation, text mining etc. The definition of the similarity between texts may not just refer to words with similar meanings. Domain similarity, which evaluates the similarity on basis of domain reference, is becoming a promising approach in dealing with large documents. By adopting domain similarity calculation, the degree of similarity could be controlled at different semantic levels, and extract texts in different domain granularity. For example, web pages of Lakers, NBA, basketball and sports could be retrieved respectively with different settings of domain similarity. LSI (Latent Semantic Indexing) is a feasible approach that can be applied to calculate text domain similarity. By controlling the number of topics, the domain similarity can be determined in different granularity. However, the performance is greatly affected by the number of specified topics, which is required for LSI algorithm. In this paper, an adaptive method was applied to word similarity calculation. TF-IDF was used to get the word frequency in the text, and the number of topics in the mixed text, set by dimensionality reduction and clustering was automatically obtained. According to the number of clusters, the similarity between text domains was calculated as the number of topics mapped to the subspace in the LSI. Accordingly, experimental results have shown that the method proposed in this paper is superior to other algorithms in the accuracy of text similarity calculation.

Chinese Word Similarity Computing Based on Combination Strategy

An adaptive method for text domain similarity calculation

Semantic Word-formation Based Chinese Word Similarity Computing

Improved Word Similarity Computation for Chinese Using Sub-word Information

Chinese Sentence Similarity Based on Multi-feature Combination

Chinese Word Similarity Computing Based on Semantic Tree

A Novel Comprehensive Approach for Estimating Concept Semantic Similarity in WordNet

A New Similarity Computing Method Based on Concept Similarity in Chinese Text Processing.

Chinese Sentence Similarity Measure Based on Word Sequence Length and Word Weight

Research on Chinese Semantic Similarity Algorithm

A New Hypred Improved Method for Measuring Concept Semantic Similarity in WordNet.

SemEval-2012 Task 4: Evaluating Chinese Word Similarity.

Application-Oriented Comparison and Evaluation of Six Semantic Similarity Measures Based on Wordnet

Sentence Similarity Computation in Question Answering Robot

Overview Of The Nlpcc-Iccpol 2016 Shared Task: Chinese Word Similarity Measurement

Combining similarity measures in content-based image retrieval guided by mutual information

COS960: A Chinese Word Similarity Dataset of 960 Word Pairs.

MIXCD: System Description for Evaluating Chinese Word Similarity at SemEval-2012.

Cross-Language Similar Document Retrieval

Wordnet Based Comparison Of Language Variation: A Study Based On Ccd And Cwn

Combination Methods of Chinese Character and Word Embeddings in Deep Learning