Abstract:A textual data processing task that involves the automatic extraction of relevant and salient keyphrases from a document that expresses all the important concepts of the document is called keyphrase extraction. Due to technological advancements, the amount of textual information on the Internet is rapidly increasing as a lot of textual information is processed online in various domains such as offices, news portals, or for research purposes. Given the exponential increase of news articles on the Internet, manually searching for similar news articles by reading the entire news content that matches the user's interests has become a time-consuming and tedious task. Therefore, automatically finding similar news articles can be a significant task in text processing. In this context, keyphrase extraction algorithms can extract information from news articles. However, selecting the most appropriate algorithm is also a problem. Therefore, this study analyzes various supervised and unsupervised keyphrase extraction algorithms, namely KEA, KP-Miner, YAKE, MultipartiteRank, TopicRank, and TeKET, which are used to extract keyphrases from news articles. The extracted keyphrases are used to compute lexical and semantic similarity to find similar news articles. The lexical similarity is calculated using the Cosine and Jaccard similarity techniques. In addition, semantic similarity is calculated using a word embedding technique called Word2Vec in combination with the Cosine similarity measure. The experimental results show that the KP-Miner keyphrase extraction algorithm, together with the Cosine similarity calculation using Word2Vec (Cosine-Word2Vec), outperforms the other combinations of keyphrase extraction algorithms and similarity calculation techniques to find similar news articles. The similar articles identified using KPMiner and the Cosine similarity measure with Word2Vec appear to be relevant to a particular news article and thus show satisfactory performance with a Normalized Discounted Cumulative Gain (NDCG) value of 0.97. This study proposes a method for finding similar news articles that can be used in conjunction with other methods already in use.

Distributed Feature Sets for Document Specific Key-Phrase Extraction

Experiment Research on Feature Selection and Learning Method in Keyphrase Extraction

Automatic Keywords Extraction Based on Co-Occurrence and Semantic Relationships Between Words

FRAKE: Fusional Real-time Automatic Keyword Extraction

Keyphrases automatic extraction from the abstracts of English scientific papers based on Scopus retrieval

KeyphraseDS: Automatic Generation of Survey by Exploiting Keyphrase Information

Bert-Based Text Keyword Extraction

Complex Network based Supervised Keyword Extractor

Single Document Keyphrase Extraction Using Neighborhood Knowledge

YAKE! Keyword extraction from single documents using multiple local features

Integrating Semantic Relatedness and Words' Intrinsic Features for Keyword Extraction.

LongKey: Keyphrase Extraction for Long Documents

Exploratory Analysis of Highly Heterogeneous Document Collections

Keyword Extraction in Scientific Documents

Hidden features identification for designing an efficient research article recommendation system

Evaluating keyphrase extraction algorithms for finding similar news articles using lexical similarity calculation and semantic relatedness measurement by word embedding

Domain-Specific Keyword Extraction Using Joint Modeling of Local and Global Contextual Semantics

Enhancing keyphrase extraction from academic articles with their reference information

An efficient domain-independent approach for supervised keyphrase extraction and ranking

Arabic Keyphrase Extraction using Linguistic knowledge and Machine Learning Techniques

Theme-weighted Ranking of Keywords from Text Documents using Phrase Embeddings