Abstract:In natural language processing (NLP), text similarity calculation is widely used in information retrieval, machine translation, text mining etc. The definition of the similarity between texts may not just refer to words with similar meanings. Domain similarity, which evaluates the similarity on basis of domain reference, is becoming a promising approach in dealing with large documents. By adopting domain similarity calculation, the degree of similarity could be controlled at different semantic levels, and extract texts in different domain granularity. For example, web pages of Lakers, NBA, basketball and sports could be retrieved respectively with different settings of domain similarity. LSI (Latent Semantic Indexing) is a feasible approach that can be applied to calculate text domain similarity. By controlling the number of topics, the domain similarity can be determined in different granularity. However, the performance is greatly affected by the number of specified topics, which is required for LSI algorithm. In this paper, an adaptive method was applied to word similarity calculation. TF-IDF was used to get the word frequency in the text, and the number of topics in the mixed text, set by dimensionality reduction and clustering was automatically obtained. According to the number of clusters, the similarity between text domains was calculated as the number of topics mapped to the subspace in the LSI. Accordingly, experimental results have shown that the method proposed in this paper is superior to other algorithms in the accuracy of text similarity calculation.

An Application of Latent Semantic Analysis for Text Categorization

An adaptive method for text domain similarity calculation

CLDA: Feature Selection for Text Categorization Based on Constrained LDA

LSASGT:an Approach to Text Categorization Based on Latent Semantic Analysis and Spectral Graph Transducer

Latent semantic analysis for text categorization using neural network

Latent Factor SVM for Text Categorization

Non-Negative Sparse Semantic Coding for Text Categorization

Supervised latent semantic indexing for document categorization

An Efficient Method for Document Categorization Based on Word2vec and Latent Semantic Analysis.

A study of semi-discrete matrix decomposition for LSI in automated text categorization

The Theory and Application of Latent Semantic Analysis

Robust discriminant analysis of latent semantic feature for text categorization

An Approach of Latent Semantic Space Partition and Web Document Clustering

Application of Latent Semantic Analysis in Auto-Grading System

Text Information Retrieval Based on Latent Semantic Analysis

A Two-Stage Feature Selection Method for Text Categorization

Fast text categorization using concise semantic analysis

A Latent Semantic Analysis Based Method of Getting the Category Attribute of Words

Multidimensional Latent Semantic Analysis Using Term Spatial Information

Aggressive Dimensionality Reduction With Reinforcement Local Feature Selection For Text Categorization

A Comprehensive Method for Text Summarization Based on Latent Semantic Analysis