Abstract:In this paper, we present a supervised framework for automatic keyword extraction from single document. We model the text as complex network, and construct the feature set by extracting select node properties from it. Several node properties have been exploited by unsupervised, graph-based keyword extraction methods to discriminate keywords from non-keywords. We exploit the complex interplay of node properties to design a supervised keyword extraction method.The training set is created from the feature set by assigning a label to each candidate keyword depending on whether the candidate is listed as a gold-standard keyword or not. Since the number of keywords in a document is much less than non-keywords, the curated training set is naturally imbalanced. We train a binary classifier to predict keywords after balancing the training set.The model is trained using two public datasets from scientific domain and tested using three unseen scientific corpora and one news corpus. Comparative study of the results with several recent keyword and keyphrase extraction methods establishes that the proposed method performs better in most cases. This substantiates our claim that graph-theoretic properties of words are effective discriminators between keywords and non-keywords. We support our argument by showing that the improved performance of the proposed method is statistically significant for all datasets. We also evaluate the effectiveness of the pre-trained model on Hindi and Assamese language documents. We observe that the model performs equally well for the cross-language text even though it was trained only on English language documents. This shows that the proposed method is independent of the domain, collection, and language of the training corpora.

Toward Selectivity Based Keyword Extraction for Croatian News

Toward Network-based Keyword Extraction from Multitopic Web Documents

Exploring Simultaneous Keyword and Key Sentence Extraction

Exploring simultaneous keyword and key sentence extraction: improve graph-based ranking using wikipedia.

Keyword Extraction Approach Based on Probabilistic-Entropy, Graph, and Neural Network Methods

Complex Network based Supervised Keyword Extractor

TakeLab Retriever: AI-Driven Search Engine for Articles from Croatian News Outlets

A preliminary study of Croatian Language Syllable Networks

FRAKE: Fusional Real-time Automatic Keyword Extraction

Initial Comparison of Linguistic Networks Measures for Parallel Texts

Quotations, Coreference Resolution, and Sentiment Annotations in Croatian News Articles: An Exploratory Study

News keyword extraction algorithm based on semantic clustering and word graph model

Complex Networks Measures for Differentiation between Normal and Shuffled Croatian Texts

Keyword Extraction using the Word Co-occurrence Network Properties that is Independent of Languages and Document Types and Its Evaluation by Prediction of Headline Words

Keyword extraction as sequence labeling with classification algorithms

Using citation networks to evaluate the impact of text length on keyword extraction

Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization

Cross-Domain Keyword Extraction with Keyness Patterns

An evaluation of keyword extraction from online communication for the characterisation of social relations

Keywords Extraction via Multi-relational Network Construction

TNT-KID: Transformer-based neural tagger for keyword identification