Abstract:As Large Language Models are deployed within Artificial Intelligence systems, that are increasingly integrated with human society, it becomes more important than ever to study their internal structures. Higher level abilities of LLMs such as GPT-3.5 emerge in large part due to informative language representations they induce from raw text data during pre-training on trillions of words. These embeddings exist in vector spaces of several thousand dimensions, and their processing involves mapping between multiple vector spaces, with total number of parameters on the order of trillions. Furthermore, these language representations are induced by gradient optimization, resulting in a black box system that is hard to interpret. In this paper, we take a look at the topological structure of neuronal activity in the "brain" of Chat-GPT's foundation language model, and analyze it with respect to a metric representing the notion of fairness. We develop a novel approach to visualize GPT's moral dimensions. We first compute a fairness metric, inspired by social psychology literature, to identify factors that typically influence fairness assessments in humans, such as legitimacy, need, and responsibility. Subsequently, we summarize the manifold's shape using a lower-dimensional simplicial complex, whose topology is derived from this metric. We color it with a heat map associated with this fairness metric, producing human-readable visualizations of the high-dimensional sentence manifold. Our results show that sentence embeddings based on GPT-3.5 can be decomposed into two submanifolds corresponding to fair and unfair moral judgments. This indicates that GPT-based language models develop a moral dimension within their representation spaces and induce an understanding of fairness during their training process.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: **Have large - language models (LLMs) discovered moral dimensions in their internal language representations, especially regarding the understanding of fairness?** Specifically, by studying the sentence - embedding vector space of GPT - 3.5, the paper explores whether these models can encode and distinguish between fair and unfair moral judgments when processing natural language. The author uses topological data analysis methods to visualize and analyze the structures in these high - dimensional vector spaces and introduces a fairness - measurement index based on social - psychology literature. ### Main problem decomposition: 1. **The importance of AI safety and alignment research**: - With the release of large - language models such as ChatGPT and GPT - 4, research on AI safety and alignment has become increasingly important. These models are gradually integrating into human society, so it is crucial to study their internal structures and potential behaviors. 2. **Spatial characteristics of language representations**: - Large - language models such as GPT - 3.5 induce information - rich language representations from a large amount of text data through pre - training. These representations exist in vector spaces with thousands of dimensions, involving mappings between multiple vector spaces, and the number of parameters can reach trillions. 3. **Explanatory challenges of black - box systems**: - Since these language representations are obtained through gradient optimization, they form a black - box system that is difficult to explain. This makes it difficult to understand and explain the behaviors of these models. 4. **Introduction of fairness measurement**: - The author has developed a new method to calculate fairness measurement, which is inspired by social - psychology literature and takes into account factors that affect human fairness evaluation, such as legality, needs, and responsibilities. 5. **Application of topological data analysis**: - Using methods of computational algebraic topology, the author simplifies the high - dimensional sentence - embedding manifold into a low - dimensional simplicial complex and visualizes it through a heat map. This visualization reveals the separation between fair and unfair moral judgments in the sentence - embedding space. ### Conclusion: - The research results show that the language model based on GPT - 3.5 can indeed develop moral dimensions in the representation space and form an understanding of fairness during the training process. Sentence embeddings can be decomposed into two sub - manifolds, corresponding to fair and unfair moral judgments respectively. Through this method, the author not only reveals the potential of large - language models in moral judgment but also provides a new tool to intrinsically examine the capabilities of these models, rather than relying solely on external behavior evaluation. This research helps to promote the development of AI safety and alignment research.

Do Large GPT Models Discover Moral Dimensions in Language Representations? A Topological Study Of Sentence Embeddings

Using cognitive psychology to understand GPT-3

The moral machine experiment on large language models

Moral Foundations of Large Language Models

Probing the Moral Development of Large Language Models through Defining Issues Test

Are Large Language Models Moral Hypocrites? A Study Based on Moral Foundations

Does GPT Really Get It? A Hierarchical Scale to Quantify Human vs AI's Understanding of Algorithms

Moral Mimicry: Large Language Models Produce Moral Rationalizations Tailored to Political Identity

Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models

A blind spot for large language models: Supradiegetic linguistic information

Mechanistic interpretability of large language models with applications to the financial services industry

Playing Games With GPT: What Can We Learn About a Large Language Model From Canonical Strategic Games?

The Cultural Psychology of Large Language Models: Is ChatGPT a Holistic or Analytic Thinker?

Language Writ Large: LLMs, ChatGPT, Grounding, Meaning and Understanding

Cognitive Effects in Large Language Models

Hidden Holes: topological aspects of language models

TrustGPT: A Benchmark for Trustworthy and Responsible Large Language Models

GPT4Graph: Can Large Language Models Understand Graph Structured Data ? an Empirical Evaluation and Benchmarking.

Does Conceptual Representation Require Embodiment? Insights From Large Language Models

Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis

Your Large Language Model is Secretly a Fairness Proponent and You Should Prompt it Like One