MediaGPT : A Large Language Model For Chinese Media
Zhonghao Wang,Zijia Lu,Bo Jin,Haiying Deng
2023-07-26
Abstract:Large language models (LLMs) have shown remarkable capabilities in generating high-quality text and making predictions based on large amounts of data, including the media domain. However, in practical applications, the differences between the media's use cases and the general-purpose applications of LLMs have become increasingly apparent, especially Chinese. This paper examines the unique characteristics of media-domain-specific LLMs compared to general LLMs, designed a diverse set of task instruction types to cater the specific requirements of the domain and constructed unique datasets that are tailored to the media domain. Based on these, we proposed MediaGPT, a domain-specific LLM for the Chinese media domain, training by domain-specific data and experts SFT data. By performing human experts evaluation and strong model evaluation on a validation set, this paper demonstrated that MediaGPT outperforms mainstream models on various Chinese media domain tasks and verifies the importance of domain data and domain-defined prompt types for building an effective domain-specific LLM.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The main goal of this paper is to propose a large language model (LLM) specifically designed for the Chinese media domain, called MediaGPT. Specifically, the paper aims to address the following issues:
1. **Limitations of general large language models in the Chinese media domain**: Existing general large language models perform poorly when handling Chinese media data, mainly due to their inability to meet the unique needs of Chinese media, such as specific writing styles, narrative structures, and political stances.
2. **Building a specialized model for the Chinese media domain**: To overcome the above limitations, the paper designs a new model—MediaGPT, which improves performance on Chinese media tasks by using domain-specific pre-training data and carefully designed supervised fine-tuning (SFT) datasets.
3. **Validating the importance of domain data and customized prompts**: Through empirical research, the paper demonstrates the importance of domain data and well-defined prompt types in building effective domain-specific large language models.
To achieve these goals, the paper conducts research in the following areas:
- **Dataset construction**: A large amount of unlabeled pre-training data was collected from authoritative Chinese and English media institutions, and a series of supervised fine-tuning datasets were designed according to the needs of Chinese media practitioners.
- **Model design and training**: The open-source LLaMA-7B model was used for pre-training and fine-tuning to meet the needs of the Chinese media domain.
- **Evaluation methods**: A combination of human expert evaluation and robust model evaluation methods was used to quantitatively assess the quality and relevance of the generated text.
In summary, the paper aims to demonstrate the superior performance of MediaGPT in Chinese media domain tasks and emphasizes the importance of domain data and customized prompt types in building efficient domain-specific models.