Zero-shot domain paraphrase with unaligned pre-trained language models

Zheng Chen,Hu Yuan,Jiankun Ren
DOI: https://doi.org/10.1007/s40747-022-00820-8
IF: 6.7
2022-08-01
Complex & Intelligent Systems
Abstract:Abstract Automatic paraphrase generation is an essential task of natural language processing. However, due to the scarcity of paraphrase corpus in many languages, Chinese, for example, generating high-quality paraphrases in these languages is still challenging. Especially in domain paraphrasing, it is even more difficult to obtain in-domain paraphrase sentence pairs. In this paper, we propose a novel approach for domain-specific paraphrase generation in a zero-shot fashion. Our approach is based on a sequence-to-sequence architecture. The encoder uses a pre-trained multilingual autoencoder model, and the decoder uses a pre-trained monolingual autoregressive model. Because these two models are pre-trained separately, they have different representations for the same token. Thus, we call them unaligned pre-trained language models. We train the sequence-to-sequence model with an English-to-Chinese machine translation corpus. Then, by inputting a Chinese sentence into this model, it could surprisingly generate fluent and diverse Chinese paraphrases. Since the unaligned pre-trained language models have inconsistent understandings of the Chinese language, we believe that the Chinese paraphrasing is actually performed in a Chinese-to-Chinese translation manner. In addition, we collect a small-scale English-to-Chinese machine translation corpus in the domain of computer science. By fine-tuning with this domain-specific corpus, our model shows an excellent capability of domain-paraphrasing. Experiment results show that our approach significantly outperforms previous baselines regarding Relevance, Fluency, and Diversity.
computer science, artificial intelligence
What problem does this paper attempt to address?