Beyond Tf-Idf And Cosine Distance In Documents Dissimilarity Measure

Sunil Aryal,Kai Ming Ting,Gholamreza Haffari,Takashi Washio
DOI: https://doi.org/10.1007/978-3-319-28940-3_33
2015-01-01
Abstract:In vector space model, different types of term weighting schemes are used to adjust bag-of-words document vectors in order to improve the performance of the most widely used cosine distance. Even though the cosine distance with some term weighting schemes result in more reliable (dis)similarity measure in some data sets, it may not perform well in others because of the underlying assumptions of the term weighting schemes. In this paper, we argue that the explicit adjustment of bag-of-words document vectors using term weighting is not required if a data-dependent dissimilarity measure called mp-dissimilarity is used. Our empirical result in document retrieval task reveals that mp with the simplest binary bag-of-words representation is either better or competitive to the cosine distance with the best performing state-of-the-art term weighting scheme in four widely used benchmark document collections.
What problem does this paper attempt to address?