Abstract:Sentence representation approaches have been widely used and proven to be effective in many text modeling tasks and downstream applications. Many recent proposals are available on learning sentence representations based on deep neural frameworks. However, these methods are pre-trained in open domains and depend on the availability of large-scale data for model fitting. As a result, they may fail in some special scenarios, where data are sparse and embedding interpretations are required, such as legal, medical, or technical fields. In this paper, we present an unsupervised learning method to exploit representations of sentences for some closed domains via topic modeling. We reformulate the inference process of the sentences with the corresponding contextual sentences and the associated words, and propose an effective context-enhanced process called the bi-Directional Context-enhanced Sentence Representation Learning (bi-DCSR). This method takes advantage of the semantic distributions of the nearby contextual sentences and the associated words to form a context-enhanced sentence representation. To support the bi-DCSR, we develop a novel Bayesian topic model to embed sentences and words into the same latent interpretable topic space called the Hybrid Priors Topic Model (HPTM). Based on the defined topic space by the HPTM, the bi-DCSR method learns the embedding of a sentence by the two-directional contextual sentences and the words in it, which allows us to efficiently learn high-quality sentence representations in such closed domains. In addition to an open-domain dataset from Wikipedia, our method is validated using three closed-domain datasets from legal cases, electronic medical records, and technical reports. Our experiments indicate that the HPTM significantly outperforms on language modeling and topic coherence, compared with the existing topic models. Meanwhile, the bi-DCSR method does not only outperform the state-of-the-art unsupervised learning methods on closed domain sentence classification tasks, but also yields competitive performance compared to these established approaches on the open domain. Additionally, the visualizations of the semantics of sentences and words demonstrate the interpretable capacity of our model.

Domain Independent Key Term Extraction from Spoken Content Based on Context and Term Location Information in the Utterances

A Novel Topic Model for Automatic Term Extraction

Cross-Domain Keyword Extraction with Keyness Patterns

Exploiting Topic-Based Adversarial Neural Network for Cross-Domain Keyphrase Extraction

A context-enhanced sentence representation learning method for close domains with topic modeling

An Exploration Of Semantic Relations In Neural Word Embeddings Using Extrinsic Knowledge

Capturing Global Informativeness in Open Domain Keyphrase Extraction

A Survey of Term Recognition and Extraction for Domainspecific Chinese Text Information Processing

Learning Knowledge-Enhanced Contextual Language Representations for Domain Natural Language Understanding

Neural Adaptation Layers for Cross-domain Named Entity Recognition

Parsing-based Automatic Chinese Term Extraction

Cross-domain Co-Extraction of Sentiment and Topic Lexicons

Can cross-domain term extraction benefit from cross-lingual transfer and nested term labeling?

An Instance Transfer based Approach Using Enhanced Recurrent Neural Network for Domain Named Entity Recognition

Local Topology Measures of Contextual Language Model Latent Spaces With Applications to Dialogue Term Extraction

An efficient domain-independent approach for supervised keyphrase extraction and ranking

Domain-Specific Keyword Extraction Using Joint Modeling of Local and Global Contextual Semantics

Automatic Extraction of Domain-Specific Terms

What's in a Domain? Learning Domain-Robust Text Representations using Adversarial Training

Improved Spoken Term Detection by Discriminative Training of Acoustic Models Based on User Relevance Feedback.

Information Extraction in Illicit Web Domains