Abstract:Over the last few years, neural network derived word embeddings became popular in the natural language processing literature. Studies conducted have mostly focused on the quality and application of word embeddings trained on public available corpuses such as Wikipedia or other news and social media sources. However, these studies are limited to generic text and thus lack technical and scientific nuances such as domain specific vocabulary, abbreviations, or scientific formulas which are commonly used in academic context. This research focuses on the performance of word embeddings applied to a large scale academic corpus. More specifically, we compare quality and efficiency of trained word embeddings to TFIDF representations in modeling content of scientific articles. We use a word2vec skip-gram model trained on titles and abstracts of about 70 million scientific articles. Furthermore, we have developed a benchmark to evaluate content models in a scientific context. The benchmark is based on a categorization task that matches articles to journals for about 1.3 million articles published in 2017. Our results show that content models based on word embeddings are better for titles (short text) while TFIDF works better for abstracts (longer text). However, the slight improvement of TFIDF for larger text comes at the expense of 3.7 times more memory requirement as well as up to 184 times higher computation times which may make it inefficient for online applications. In addition, we have created a 2-dimensional visualization of the journals modeled via embeddings to qualitatively inspect embedding model. This graph shows useful insights and can be used to find competitive journals or gaps to propose new journals.

Evaluation method of word embedding by roots and affixes

Evaluating Word Embedding Models: Methods and Experimental Results

An Exploration Of Semantic Relations In Neural Word Embeddings Using Extrinsic Knowledge

Visual Exploration and Comparison of Word Embeddings.

A Regression Approach to Valence-Arousal Ratings of Words from Word Embedding.

Radical and Stroke-Enhanced Chinese Word Embeddings Based on Neural Networks

Evaluation of Word Embedding Via Domain Keywords

The Expressive Power of Word Embeddings

Inferring Affective Meanings of Words from Word Embedding

An Adaptive Wordpiece Language Model For Learning Chinese Word Embeddings

A novel model for semantic similarity measurement based on wordnet and word embedding

Revisiting Correlations between Intrinsic and Extrinsic Evaluations of Word Embeddings.

An Efficient Method Based on Region-adjacent Embedding for Text Classification of Chinese Electronic Medical Records

Radical Enhanced Chinese Word Embedding.

A Fistful of Vectors: A Tool for Intrinsic Evaluation of Word Embeddings

Enriching Word Embeddings with Domain Knowledge for Readability Assessment.

Document Embedding for Scientific Articles: Efficacy of Word Embeddings vs TFIDF

Extending Embedding Representation by Incorporating Latent Relations.

How to Generate a Good Word Embedding?

Improving Word Embeddings by Emphasizing Co-hyponyms.

Improving Word Embeddings for Antonym Detection Using Thesauri and SentiWordNet.