Abstract:Large language models (LLMs) have shown remarkable capabilities in generating high-quality text and making predictions based on large amounts of data, including the media domain. However, in practical applications, the differences between the media's use cases and the general-purpose applications of LLMs have become increasingly apparent, especially Chinese. This paper examines the unique characteristics of media-domain-specific LLMs compared to general LLMs, designed a diverse set of task instruction types to cater the specific requirements of the domain and constructed unique datasets that are tailored to the media domain. Based on these, we proposed MediaGPT, a domain-specific LLM for the Chinese media domain, training by domain-specific data and experts SFT data. By performing human experts evaluation and strong model evaluation on a validation set, this paper demonstrated that MediaGPT outperforms mainstream models on various Chinese media domain tasks and verifies the importance of domain data and domain-defined prompt types for building an effective domain-specific LLM.

What problem does this paper attempt to address?

The main goal of this paper is to propose a large language model (LLM) specifically designed for the Chinese media domain, called MediaGPT. Specifically, the paper aims to address the following issues: 1. **Limitations of general large language models in the Chinese media domain**: Existing general large language models perform poorly when handling Chinese media data, mainly due to their inability to meet the unique needs of Chinese media, such as specific writing styles, narrative structures, and political stances. 2. **Building a specialized model for the Chinese media domain**: To overcome the above limitations, the paper designs a new model—MediaGPT, which improves performance on Chinese media tasks by using domain-specific pre-training data and carefully designed supervised fine-tuning (SFT) datasets. 3. **Validating the importance of domain data and customized prompts**: Through empirical research, the paper demonstrates the importance of domain data and well-defined prompt types in building effective domain-specific large language models. To achieve these goals, the paper conducts research in the following areas: - **Dataset construction**: A large amount of unlabeled pre-training data was collected from authoritative Chinese and English media institutions, and a series of supervised fine-tuning datasets were designed according to the needs of Chinese media practitioners. - **Model design and training**: The open-source LLaMA-7B model was used for pre-training and fine-tuning to meet the needs of the Chinese media domain. - **Evaluation methods**: A combination of human expert evaluation and robust model evaluation methods was used to quantitatively assess the quality and relevance of the generated text. In summary, the paper aims to demonstrate the superior performance of MediaGPT in Chinese media domain tasks and emphasizes the importance of domain data and customized prompt types in building efficient domain-specific models.

MediaGPT : A Large Language Model For Chinese Media

AcademicGPT: Empowering Academic Research

DB-GPT: Large Language Model Meets Database

UrbanGPT: Spatio-Temporal Large Language Models

NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism

SeqGPT: An Out-of-the-box Large Language Model for Open Domain Sequence Understanding

ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences

DoctorGPT: A Large Language Model with Chinese Medical Question-Answering Capabilities

Radiology-GPT: A Large Language Model for Radiology

Large Language Models as Data Preprocessors

An Evaluation of Large Language Models in Bioinformatics Research

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

ModelGPT: Unleashing LLM's Capabilities for Tailored Model Generation

EventGPT: Event Stream Understanding with Multimodal Large Language Models

Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

CourseGPT-zh: an Educational Large Language Model Based on Knowledge Distillation Incorporating Prompt Optimization

VLM-Eval: A General Evaluation on Video Large Language Models

HPC-GPT: Integrating Large Language Model for High-Performance Computing

GPT-4V(ision) as A Social Media Analysis Engine

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages