Reverse Transfer Learning: Can Word Embeddings Trained for Different NLP Tasks Improve Neural Language Models?

Lyan Verwimp,Jerome R. Bellegarda
DOI: https://doi.org/10.48550/arXiv.1909.04130
2019-09-10
Abstract:Natural language processing (NLP) tasks tend to suffer from a paucity of suitably annotated training data, hence the recent success of transfer learning across a wide variety of them. The typical recipe involves: (i) training a deep, possibly bidirectional, neural network with an objective related to language modeling, for which training data is plentiful; and (ii) using the trained network to derive contextual representations that are far richer than standard linear word embeddings such as word2vec, and thus result in important gains. In this work, we wonder whether the opposite perspective is also true: can contextual representations trained for different NLP tasks improve language modeling itself? Since language models (LMs) are predominantly locally optimized, other NLP tasks may help them make better predictions based on the entire semantic fabric of a document. We test the performance of several types of pre-trained embeddings in neural LMs, and we investigate whether it is possible to make the LM more aware of global semantic information through embeddings pre-trained with a domain classification model. Initial experiments suggest that as long as the proper objective criterion is used during training, pre-trained embeddings are likely to be beneficial for neural language modeling.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: Can the contextual representations trained for different natural language processing (NLP) tasks be used to improve the language model (LM) itself? Specifically, the author explores transferring the knowledge of other NLP tasks into the language model through pre - trained embeddings to enhance its ability to perceive global semantic information, thereby improving the performance of the language model. Since language models are usually mainly optimized for local prediction, introducing other NLP tasks may help the model make better predictions based on the overall semantic structure of the document. The paper mentions that although language models can learn rich local context information when trained on a large amount of data, they may overlook global semantic information. For example, when predicting specific content words, if only relying on the local context, it may be impossible to accurately predict certain words (such as "hurricane"), because the appearance of these words requires considering the overall background of the text. Therefore, the author proposes a method of "reverse transfer learning", that is, transferring knowledge from other NLP tasks to the language model to make up for the deficiency of the language model in this regard. To verify this hypothesis, the author experimented with several types of pre - trained embeddings, including embeddings trained based on local context (such as word2vec) and embeddings aimed at capturing global semantic information (such as embeddings trained through domain classification models). The experimental results show that when the pre - trained task is closely related to the target task, the pre - trained embeddings can more effectively improve the performance of the language model. In particular, embeddings trained with a bidirectional language model can also significantly reduce perplexity on smaller data sets, indicating that even with a limited amount of data, appropriate pre - training can effectively improve the performance of the language model.