Latent Universal Task-Specific BERT

Alon Rozental,Zohar Kelrich,Daniel Fleischer
DOI: https://doi.org/10.48550/arXiv.1905.06638
2019-05-16
Abstract:This paper describes a language representation model which combines the Bidirectional Encoder Representations from Transformers (BERT) learning mechanism described in Devlin et al. (2018) with a generalization of the Universal Transformer model described in Dehghani et al. (2018). We further improve this model by adding a latent variable that represents the persona and topics of interests of the writer for each training example. We also describe a simple method to improve the usefulness of our language representation for solving problems in a specific domain at the expense of its ability to generalize to other fields. Finally, we release a pre-trained language representation model for social texts that was trained on 100 million tweets.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
This paper attempts to solve task - specific problems in natural language processing by combining the BERT model and the Universal Transformer model, and to improve the model's ability to model the interest in specific authors and topics by introducing latent variables. Specifically, the main contributions and goals of the paper include: 1. **Extending the BERT model**: Improve the BERT model by introducing two new mechanisms: - Dynamically calculate the number of iterations for each token, similar to the method of the Universal Transformer. - Implement latent variables representing different "types" of authors to increase the accuracy of missing - word prediction, and these variables are represented in the bias term of the last layer. 2. **Task - specific pre - training**: Propose a simple method to specialize the pre - training process by adding category weights to each token in the vocabulary to improve the performance of specific tasks. For example, use larger weights for emojis and smaller weights for URLs and Twitter mentions, thereby reducing noise and focusing on important features. 3. **Model performance improvement**: Verified by experiments, the proposed model not only performs better in the task of predicting missing words, but also significantly reduces the number of parameters, thereby improving the efficiency and performance of the model. 4. **Application of social text data**: Pay special attention to short - text interactions on Twitter and provide a pre - trained language representation model, which is trained on 100 million tweets and is suitable for the processing of social texts. Through these improvements, the paper aims to improve the performance of natural language processing models on specific tasks while maintaining the generalization ability and computational efficiency of the models.