Automatic Webpage Briefing

Yimeng Dai,Rui Zhang,Jianzhong Qi
DOI: https://doi.org/10.1109/ICDE51399.2021.00152
2021-01-01
Abstract:We introduce the task of webpage briefing (WB) to provide a summary of a webpage in a hierarchical manner, from the broad topic of the webpage, to finer level key attributes. A straightforward approach for this task is to train a machine learning model for generating topics and extracting key attributes. However, such a model may not perform well on webpages that are from domains not seen in the training data. An ideal model should be able to adapt to unseen domains while preserving knowledge learned from the seen domains. Knowledge distillation (KD) offers a potential solution, in which a teacher pre-trained with specific domains can pass the knowledge to a student, while unseen domains can also be added to increase the robustness of the models. However, existing works usually assume the models have no access to seen domains during distillation and the knowledge on seen domains may be lost. In our setting, we have access to the generated topics, which contain representative knowledge of seen domains and can help preserve that knowledge during distillation. Moreover, a vanilla KD does not pass on the knowledge about the location patterns of the informative contents in webpages, which are essential for identifying the topics to be generated or the key attributes to be extracted. To preserve more knowledge of seen domains and to better utilize the location patterns, we propose a Dual Distillation model which consists of identification distillation (ID) and understanding distillation (UD); ID distills knowledge on the identification of informative contents under the guidance of the learned topics of seen domains, while UD distills knowledge on topic generation or key attribute extraction. Since topics and key attributes are distilled separately in two students in Dual Distillation, the inherent correlations between them are not utilized. To better exploit such correlations, we propose a Triple Distillation model which consists of a shared ID and two UDs, one for topic generation and the other for key attribute extraction. We further propose a joint model for WB with signal enhancement and exchange among a key attribute extractor, a topic generator, and an informative section predictor. Experiments on real-world webpages show that our models achieve high performances for WB, and validate the superiority of Dual Distillation and Triple Distillation in their target settings. Experiments also show that the proposed joint model outperforms single-task baselines and other joint models.
What problem does this paper attempt to address?