EnsLM: Ensemble Language Model for Data Diversity by Semantic Clustering

Zhibin Duan,Hao Zhang,Chaojie Wang,Zhengjue Wang,Bo Chen,Mingyuan Zhou
DOI: https://doi.org/10.18653/v1/2021.acl-long.230
2021-01-01
Abstract:Natural language processing often faces the problem of data diversity such as different domains, themes, styles and so on. Therefore, a single language model (LM) is insufficient to learn all knowledge from diverse samples. To solve this problem, we firstly propose an autoencoding topic model with mixture prior (mATM) to perform clustering for the data, where the clusters defined in semantic space describe the data diversity. Having obtained the clustering assignment for each sample, we develop the ensemble LM (EnsLM) with the technique of weight modulation. Specifically, EnsLM contains a backbone which is adjusted by a few modulated weights to fit for different sample clusters. As a result, the backbone learns the shared knowledge among all clusters while modulated weights extract the cluster-specific features. EnsLM can be trained jointly with mATM with flexible LM backbone. We evaluate the effectiveness of both mATM and EnsLM on different language understanding and generative tasks.
What problem does this paper attempt to address?