Abstract:Large language models (LLMs) have demonstrated impressive performance in various natural language processing (NLP) tasks. However, there is limited understanding of how well LLMs perform in specific domains (e.g, the intellectual property (IP) domain). In this paper, we contribute a new benchmark, the first Multilingual-oriented quiZ on Intellectual Property (MoZIP), for the evaluation of LLMs in the IP domain. The MoZIP benchmark includes three challenging tasks: IP multiple-choice quiz (IPQuiz), IP question answering (IPQA), and patent matching (PatentMatch). In addition, we also develop a new IP-oriented multilingual large language model (called MoZi), which is a BLOOMZ-based model that has been supervised fine-tuned with multilingual IP-related text data. We evaluate our proposed MoZi model and four well-known LLMs (i.e., BLOOMZ, BELLE, ChatGLM and ChatGPT) on the MoZIP benchmark. Experimental results demonstrate that MoZi outperforms BLOOMZ, BELLE and ChatGLM by a noticeable margin, while it had lower scores compared with ChatGPT. Notably, the performance of current LLMs on the MoZIP benchmark has much room for improvement, and even the most powerful ChatGPT does not reach the passing level. Our source code, data, and models are available at \url{

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the performance evaluation issues of large language models (LLMs) in the field of intellectual property (IP). Specifically, the paper focuses on the following aspects: 1. **Lack of specialized benchmarks**: Although existing large language models perform well in natural language processing tasks, their performance in specific domains (such as intellectual property) remains unclear. The paper points out that there is currently a lack of benchmarks specifically designed to evaluate these models' understanding and application effectiveness in the field of intellectual property. 2. **Insufficient multilingual support**: Intellectual property involves multiple languages, but existing benchmarks and models typically support only a few languages. Therefore, a benchmark that supports multiple languages is needed to comprehensively evaluate the models' performance. 3. **Limitations of model performance**: Existing large language models have significant shortcomings when handling tasks related to intellectual property, especially in understanding complex intellectual property concepts and regulations. The paper aims to evaluate and improve these models' performance by constructing new benchmarks and models. ### Solutions To address the above issues, the paper proposes the following solutions: 1. **Constructing the MoZIP benchmark**: The paper introduces a new multilingual intellectual property benchmark (MoZIP), which includes three challenging tasks: - **IPQuiz**: Multiple-choice questions on intellectual property to evaluate the model's understanding of intellectual property concepts and regulations. - **IPQA**: Intellectual property question answering to assess the model's ability to understand intellectual property-related questions. - **PatentMatch**: Patent matching to evaluate the model's understanding of inventions described in patent documents and its ability to distinguish between different patents. 2. **Developing the MoZi model**: The paper proposes a new large-scale multilingual language model oriented towards intellectual property (MoZi), based on the BLOOMZ model, and trained through supervised fine-tuning on multilingual intellectual property-related text data. The MoZi model undergoes three stages of fine-tuning: - **Patent pre-training**: Using 24 million official patent documents to familiarize the model with the writing style and technical details of patents. - **General instruction fine-tuning**: Using 3 million general instruction data from multiple public datasets to enhance the model's general capabilities. - **Intellectual property instruction fine-tuning**: Using 58,874 self-constructed multilingual Q&A data, Chinese intellectual property-related legal clauses, and multi-turn dialogue data generated by ChatGPT to further improve the model's understanding in the field of intellectual property. 3. **Experimental evaluation**: The paper evaluates the MoZi model and four other well-known large-scale language models (BLOOMZ, BELLE, ChatGLM, and ChatGPT) on the MoZIP benchmark. Experimental results show that MoZi significantly outperforms BLOOMZ, BELLE, and ChatGLM on multiple tasks, but still falls short of ChatGPT on some tasks. Overall, there is still considerable room for improvement in the performance of current large-scale language models in the field of intellectual property. ### Conclusion By constructing the MoZIP benchmark and developing the MoZi model, the paper provides important tools and methods for evaluating and improving the performance of large-scale language models in the field of intellectual property. Although the MoZi model performs well on multiple tasks, further research and improvements are needed to achieve higher performance levels.

MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in Intellectual Property

PatentGPT: A Large Language Model for Intellectual Property

IPEval: A Bilingual Intellectual Property Agency Consultation Evaluation Benchmark for Large Language Models

ZhuJiu: A Multi-dimensional, Multi-faceted Chinese Benchmark for Large Language Models

LawBench: Benchmarking Legal Knowledge of Large Language Models

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs

PolyLM: An Open Source Polyglot Large Language Model

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

Beyond Text: Unveiling Multimodal Proficiency of Large Language Models with MultiAPI Benchmark

ZhoBLiMP: a Systematic Assessment of Language Models with Linguistic Minimal Pairs in Chinese

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning

LexEval: A Comprehensive Chinese Legal Benchmark for Evaluating Large Language Models

LIME: Less Is More for MLLM Evaluation

InternLM-Law: An Open Source Chinese Legal Large Language Model

CMMLU: Measuring massive multitask language understanding in Chinese

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data

MOSS: an Open Conversational Large Language Model