A Novel Document Distance Based On Concept Vector Space

Lin Li,Hui Li
DOI: https://doi.org/10.1109/icct.2017.8359982
2017-01-01
Abstract:A novel metric to measure the distance between documents is proposed in this paper. By utilizing the recent results in word embeddings which can present semantical information between words by real-value vectors, we model a document as a concept vector space, where the concepts are a series of key words extracted based on the text by dependency parsing and linguistic knowledge. A new document distance is defined on the concept vector space to measure the relatedness or similarity between two documents, which can be used in many natural language processing (NLP) task such as document classification, news clustering, etc. The proposed metric has no hyperparameters to tuning and is easily to compute. Further we give a demonstration on a few real world document classification datasets based on k-nearest neighbor (kNN) algorithm. The experiment results show that the new document distance can lead to an impressive quality improvement on document classification.
What problem does this paper attempt to address?