Abstract:Domain-specific keyword extraction is a vital task in the field of text mining. There are various research tasks, such as spam e-mail classification, abusive language detection, sentiment analysis, and emotion mining, where a set of domain-specific keywords (aka lexicon) is highly effective. Existing works for keyword extraction list all keywords rather than domain-specific keywords from a document corpus. Moreover, most of the existing approaches perform well on formal document corpuses but fail on noisy and informal user-generated content in online social media. In this article, we present a hybrid approach by jointly modeling the local and global contextual semantics of words, utilizing the strength of distributional word representation and contrasting-domain corpus for domain-specific keyword extraction. Starting with a seed set of a few domain-specific keywords, we model the text corpus as a weighted word-graph. In this graph, the initial weight of a node (word) represents its semantic association with the target domain calculated as a linear combination of three semantic association metrics, and the weight of an edge connecting a pair of nodes represents the co-occurrence count of the respective words. Thereafter, a modified PageRank method is applied to the word-graph to identify the most relevant words for expanding the initial set of domain-specific keywords. We evaluate our method over both formal and informal text corpuses (comprising six datasets), and show that it performs significantly better in comparison to state-of-the-art methods. Furthermore, we generalize our approach to handle the language-agnostic case, and show that it outperforms existing language-agnostic approaches.

Extracting Domain-Specific Terms From Unlabeled Web Documents By Bootstrapping And Term Classifiers

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

A Novel Topic Model for Automatic Term Extraction

Automatic Extraction of Domain-Specific Terms

Research on Automatic Chinese Multi-word Term Extraction Based on Integration of Web Information and Term Component

A Survey of Term Recognition and Extraction for Domainspecific Chinese Text Information Processing

Parsing-based Automatic Chinese Term Extraction

Domain-specific website recognition using hybrid vector space model

Research on Automatic Chinese Multi-word Term Extraction Based on Term Component

Domain-Specific New Words Detection in Chinese.

Bootstrapping Large-scale Named Entities Using URL-Text Hybrid Patterns.

Bootstrapping Information Extraction Via Conceptualization

Cross-domain Co-Extraction of Sentiment and Topic Lexicons

SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval

Discriminatively Modeling Commonality of Term Types for Extracting Relation from Small Corpora

Exploiting Collective Hidden Structures In Webpage Titles For Open Domain Entity Extraction

Domain Independent Key Term Extraction from Spoken Content Based on Context and Term Location Information in the Utterances

Domain-Specific Keyword Extraction Using Joint Modeling of Local and Global Contextual Semantics

On the Unsupervised Analysis of Domain-Specific Chinese Texts

Measuring Termhood in Automatic Terminology Extraction

Bilingual Terminology Extraction Using Multi-level Termhood