Abstract:Pre-trained language models (PLMs) like BERT have made great progress in NLP. News articles usually contain rich textual information, and PLMs have the potentials to enhance news text modeling for various intelligent news applications like news recommendation and retrieval. However, most existing PLMs are in huge size with hundreds of millions of parameters. Many online news applications need to serve millions of users with low latency tolerance, which poses huge challenges to incorporating PLMs in these scenarios. Knowledge distillation techniques can compress a large PLM into a much smaller one and meanwhile keeps good performance. However, existing language models are pre-trained and distilled on general corpus like Wikipedia, which has some gaps with the news domain and may be suboptimal for news intelligence. In this paper, we propose NewsBERT, which can distill PLMs for efficient and effective news intelligence. In our approach, we design a teacher-student joint learning and distillation framework to collaboratively learn both teacher and student models, where the student model can learn from the learning experience of the teacher model. In addition, we propose a momentum distillation method by incorporating the gradients of teacher model into the update of student model to better transfer useful knowledge learned by the teacher model. Extensive experiments on two real-world datasets with three tasks show that NewsBERT can effectively improve the model performance in various intelligent news applications with much smaller models.

DPAL-BERT: A Faster and Lighter Question Answering Model

Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains.

Model Compression with Two-stage Multi-teacher Knowledge Distillation for Web Question Answering System

Language Model Knowledge Distillation for Efficient Question Answering in Spanish

Patient Knowledge Distillation for BERT Model Compression

Structured Pruning of a BERT-based Question Answering Model

Mandarin Text-to-Speech Front-End with Lightweight Distilled Convolution Network

MLKD-BERT: Multi-level Knowledge Distillation for Pre-trained Language Models

Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation

NewsBERT: Distilling Pre-trained Language Model for Intelligent News Application

QEKD: Query-Efficient and Data-Free Knowledge Distillation from Black-box Models.

ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder Via Self On-the-fly Distillation for Dense Passage Retrieval

LRC-BERT: Latent-representation Contrastive Knowledge Distillation for Natural Language Understanding

[Knowledge about genotype-phenotype of the diseases should be coming into pediatrician's horizon].

Towards Developing a Multilingual and Code-Mixed Visual Question Answering System by Knowledge Distillation

Pea-KD: Parameter-efficient and Accurate Knowledge Distillation on BERT

AdaDS: Adaptive Data Selection for Accelerating Pre-Trained Language Model Knowledge Distillation

ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self On-the-fly Distillation for Dense Passage Retrieval

Towards Non-task-specific Distillation of BERT via Sentence Representation Approximation

One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers

XtremeDistil: Multi-stage Distillation for Massive Multilingual Models