Abstract:Moral values play a fundamental role in how we evaluate information, make decisions, and form judgements around important social issues. Controversial topics, including vaccination, abortion, racism, and sexual orientation, often elicit opinions and attitudes that are not solely based on evidence but rather reflect moral worldviews. Recent advances in Natural Language Processing (NLP) show that moral values can be gauged in human-generated textual content. Building on the Moral Foundations Theory (MFT), this paper introduces MoralBERT, a range of language representation models fine-tuned to capture moral sentiment in social discourse. We describe a framework for both aggregated and domain-adversarial training on multiple heterogeneous MFT human-annotated datasets sourced from Twitter (now X), Reddit, and Facebook that broaden textual content diversity in terms of social media audience interests, content presentation and style, and spreading patterns. We show that the proposed framework achieves an average F1 score that is between 11% and 32% higher than lexicon-based approaches, Word2Vec embeddings, and zero-shot classification with large language models such as GPT-4 for in-domain inference. Domain-adversarial training yields better out-of domain predictions than aggregate training while achieving comparable performance to zero-shot learning. Our approach contributes to annotation-free and effective morality learning, and provides useful insights towards a more comprehensive understanding of moral narratives in controversial social debates using NLP.

Metadata Might Make Language Models Better

Language Models Learn Metadata: Political Stance Detection Case Study

Making Metadata More FAIR Using Large Language Models

MoralBERT: A Fine-Tuned Language Model for Capturing Moral Values in Social Discussions

Towards Effective Time-Aware Language Representation: Exploring Enhanced Temporal Understanding in Language Models

From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French

Utilising a Large Language Model to Annotate Subject Metadata: A Case Study in an Australian National Research Data Catalogue

NextLevelBERT: Masked Language Modeling with Higher-Level Representations for Long Documents

Dynamic Benchmarking of Masked Language Models on Temporal Concept Drift with Multiple Views

Enriching Language Models with Graph-Based Context Information to Better Understand Textual Data

Time-Aware Language Models as Temporal Knowledge Bases

Temporal Effects on Pre-trained Models for Language Processing Tasks

Evaluating Short-Term Temporal Fluctuations of Social Biases in Social Media Data and Masked Language Models

Time Matters: Examine Temporal Effects on Biomedical Language Models

Efficient Continue Training of Temporal Language Model with Structural Information

Metadata-based Data Exploration with Retrieval-Augmented Generation for Large Language Models

Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts during Language Model Training

Measuring Social Biases in Masked Language Models by Proxy of Prediction Quality

hmBERT: Historical Multilingual Language Models for Named Entity Recognition

Investigating Masking-based Data Generation in Language Models

Implicit meta-learning may lead language models to trust more reliable sources