Abstract:In natural language processing (NLP), text similarity calculation is widely used in information retrieval, machine translation, text mining etc. The definition of the similarity between texts may not just refer to words with similar meanings. Domain similarity, which evaluates the similarity on basis of domain reference, is becoming a promising approach in dealing with large documents. By adopting domain similarity calculation, the degree of similarity could be controlled at different semantic levels, and extract texts in different domain granularity. For example, web pages of Lakers, NBA, basketball and sports could be retrieved respectively with different settings of domain similarity. LSI (Latent Semantic Indexing) is a feasible approach that can be applied to calculate text domain similarity. By controlling the number of topics, the domain similarity can be determined in different granularity. However, the performance is greatly affected by the number of specified topics, which is required for LSI algorithm. In this paper, an adaptive method was applied to word similarity calculation. TF-IDF was used to get the word frequency in the text, and the number of topics in the mixed text, set by dimensionality reduction and clustering was automatically obtained. According to the number of clusters, the similarity between text domains was calculated as the number of topics mapped to the subspace in the LSI. Accordingly, experimental results have shown that the method proposed in this paper is superior to other algorithms in the accuracy of text similarity calculation.

An improvement to TF-IDF: Term Distribution based Term Weight Algorithm.

An adaptive method for text domain similarity calculation

Using modified term frequency to improve term weighting for text classification

An improved supervised term weighting scheme for text representation and classification

Exploiting Category Information and Document Information to Improve Term Weighting for Text Categorization

A Novel Term Weighting Scheme for Automated Text Categorization

Several alternative term weighting methods for text representation and classification

Modified DFS-based term weighting scheme for text classification

A study of supervised term weighting scheme for sentiment analysis

A generic multi-level framework for building term-weighting schemes in text classification

A Study of the Application of Weight Distributing Method Combining Sentiment Dictionary and TF-IDF for Text Sentiment Analysis

Research on Text Similarity Measurement Hybrid Algorithm with Term Semantic Information and TF-IDF Method

TF-IDFC-RF: A Novel Supervised Term Weighting Scheme

Beyond Tf-Idf And Cosine Distance In Documents Dissimilarity Measure

An Improved TF-IDF Approach for Text Classification

Balancing between over-weighting and under-weighting in supervised term weighting

Research on dynamic self-adaptive term weighting for multi-class text classification algorithm

Reducing Over-Weighting in Supervised Term Weighting for Sentiment Analysis.

Online Hot Topic Discovery and Hotness Evaluation

Supervised Term Weighting Metrics for Sentiment Analysis in Short Text

A Comparative Study on Feature Weight in Text Categorization