Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

Zheng-Xin Yong,Ruochen Zhang,Jessica Zosa Forde,Skyler Wang,Arjun Subramonian,Holy Lovenia,Samuel Cahyawijaya,Genta Indra Winata,Lintang Sutawika,Jan Christian Blaise Cruz,Yin Lin Tan,Long Phan,Rowena Garcia,Thamar Solorio,Alham Fikri Aji

2023-09-13

Abstract:While code-mixing is a common linguistic practice in many parts of the world, collecting high-quality and low-cost code-mixed data remains a challenge for natural language processing (NLP) research. The recent proliferation of Large Language Models (LLMs) compels one to ask: how capable are these systems in generating code-mixed data? In this paper, we explore prompting multilingual LLMs in a zero-shot manner to generate code-mixed data for seven languages in South East Asia (SEA), namely Indonesian, Malay, Chinese, Tagalog, Vietnamese, Tamil, and Singlish. We find that publicly available multilingual instruction-tuned models such as BLOOMZ and Flan-T5-XXL are incapable of producing texts with phrases or clauses from different languages. ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing. For instance, ChatGPT generates fluent and natural Singlish texts (an English-based creole spoken in Singapore), but for English-Tamil language pair, the system mostly produces grammatically incorrect or semantically meaningless utterances. Furthermore, it may erroneously introduce languages not specified in the prompt. Based on our investigation, existing multilingual LLMs exhibit a wide range of proficiency in code-mixed data generation for SEA languages. As such, we advise against using LLMs in this context without extensive human checks.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to explore the capability of multilingual large language models (LLMs) in generating code-mixed text. Specifically, the researchers evaluate whether these models can generate code-mixed data for Southeast Asian languages in a zero-shot prompting scenario and analyze the performance of different models on this task. The study finds that while some models, such as ChatGPT, excel in generating specific types of code-mixed text, most publicly available multilingual models (e.g., BLOOMZ and Flan-T5-XXL) have limited ability to generate truly code-mixed data. Additionally, the research points out that existing multilingual models face numerous issues when generating code-mixed data, including grammatical errors, semantic inaccuracies, and the introduction of unspecified languages. Therefore, the authors recommend against using current LLMs to generate synthetic code-mixed data without thorough human review.

Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

Code-Mixer Ya Nahi: Novel Approaches to Measuring Multilingual LLMs' Code-Mixing Capabilities

Bridging the Language Gap: Enhancing Multilingual Prompt-Based Code Generation in LLMs via Zero-Shot Cross-Lingual Transfer

Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts

Code-mixed LLM: Improve Large Language Models' Capability to Handle Code-Mixing through Reinforcement Learning from AI Feedback

SeaLLMs -- Large Language Models for Southeast Asia

Large Language Models are Good Multi-lingual Learners : When LLMs Meet Cross-lingual Prompts

What talking you?: Translating Code-Mixed Messaging Texts to English

SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages

Learning-From-Mistakes Prompting for Indigenous Language Translation

Understanding and Mitigating Language Confusion in LLMs

Automatic Prompt Selection for Large Language Models

Compass: Large Multilingual Language Model for South-east Asia

Exploring Multi-Lingual Bias of Large Code Models in Code Generation

Bridging the Gap: Dynamic Learning Strategies for Improving Multilingual Performance in LLMs

Few-Shot Cross-Lingual Transfer for Prompting Large Language Models in Low-Resource Languages

Decomposed Prompting: Unveiling Multilingual Linguistic Structure Knowledge in English-Centric Large Language Models

Hire a Linguist!: Learning Endangered Languages with In-Context Linguistic Descriptions

Leveraging Large Language Models for Code-Mixed Data Augmentation in Sentiment Analysis

Improving Bilingual Capabilities of Language Models to Support Diverse Linguistic Practices in Education

Better to Ask in English: Evaluation of Large Language Models on English, Low-resource and Cross-Lingual Settings