Abstract:Cultural bias is pervasive in many large language models (LLMs), largely due to the deficiency of data representative of different cultures. Typically, cultural datasets and benchmarks are constructed either by extracting subsets of existing datasets or by aggregating from platforms such as Wikipedia and social media. However, these approaches are highly dependent on real-world data and human annotations, making them costly and difficult to scale. Inspired by cognitive theories on social communication, this paper introduces CulturePark, an LLM-powered multi-agent communication framework for cultural data collection. CulturePark simulates cross-cultural human communication with LLM-based agents playing roles in different cultures. It generates high-quality cross-cultural dialogues encapsulating human beliefs, norms, and customs. Using CulturePark, we generated 41,000 cultural samples to fine-tune eight culture-specific LLMs. We evaluated these models across three downstream tasks: content moderation, cultural alignment, and cultural education. Results show that for content moderation, our GPT-3.5-based models either match or outperform GPT-4 on datasets. Regarding cultural alignment, our models surpass GPT-4 on Hofstede's VSM 13 framework. Furthermore, for cultural education of human participants, our models demonstrate superior outcomes in both learning efficacy and user experience compared to GPT-4. CulturePark proves an important step in addressing cultural bias and advancing the democratization of AI, highlighting the critical role of culturally inclusive data in model training. Code is released at <a class="link-external link-https" href="https://github.com/Scarelette/CulturePark" rel="external noopener nofollow">this https URL</a>.

CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages

CultureLLM: Incorporating Cultural Differences into Large Language Models

CulturePark: Boosting Cross-cultural Understanding in Large Language Models

Datasets for Large Language Models: A Comprehensive Survey

CIVICS: Building a Dataset for Examining Culturally-Informed Values in Large Language Models

Massively Multi-Cultural Knowledge Acquisition & LM Benchmarking

CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting

Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models

LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

MultiLegalPile: A 689GB Multilingual Legal Corpus

CDEval: A Benchmark for Measuring the Cultural Dimensions of Large Language Models

WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models

UCCIX: Irish-eXcellence Large Language Model

Xmodel-1.5: An 1B-scale Multilingual LLM

Zyda: A 1.3T Dataset for Open Language Modeling

A Large-Scale Chinese Short-Text Conversation Dataset

A New Massive Multilingual Dataset for High-Performance Language Technologies

CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs

Cultural Fidelity in Large-Language Models: An Evaluation of Online Language Resources as a Driver of Model Performance in Value Representation

Socially Responsible Data for Large Multilingual Language Models

Tagengo: A Multilingual Chat Dataset