Abstract:When we experience a visual stimulus as beautiful, how much of that experience derives from perceptual computations we cannot describe versus conceptual knowledge we can readily translate into natural language? Disentangling perception from language in visually-evoked affective and aesthetic experiences through behavioral paradigms or neuroimaging is often empirically intractable. Here, we circumnavigate this challenge by using linear decoding over the learned representations of unimodal vision, unimodal language, and multimodal (language-aligned) deep neural network (DNN) models to predict human beauty ratings of naturalistic images. We show that unimodal vision models (e.g. SimCLR) account for the vast majority of explainable variance in these ratings. Language-aligned vision models (e.g. SLIP) yield small gains relative to unimodal vision. Unimodal language models (e.g. GPT2) conditioned on visual embeddings to generate captions (via CLIPCap) yield no further gains. Caption embeddings alone yield less accurate predictions than image and caption embeddings combined (concatenated). Taken together, these results suggest that whatever words we may eventually find to describe our experience of beauty, the ineffable computations of feedforward perception may provide sufficient foundation for that experience.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to explore the relationship between perception and language in visual aesthetic experiences (i.e., our sense of beauty). Specifically, the authors attempt to answer the following questions: 1. **Relationship between perception and language**: When we find a certain visual stimulus beautiful, how much of this experience is determined by perceptual computations that we cannot describe in words, and how much is determined by conceptual knowledge that we can express in natural language? 2. **Disentangling perception and language**: It is often empirically difficult to disentangle the roles of perception and language in visually - induced emotions and aesthetic experiences through behavioral paradigms or neuroimaging methods. Therefore, the authors propose a new method, using linear decoding techniques to evaluate the performance of different types of deep neural network models (unimodal vision, unimodal language, and multimodal language - aligned models) in predicting human aesthetic scores for natural images. 3. **Model performance**: - Can unimodal vision models (such as SimCLR) explain most of the variation in aesthetic scores? - Can language - aligned vision models (such as SLIP) bring additional gains? - Can captions generated by combining unimodal language models (such as GPT2) with visual embeddings further improve prediction accuracy? - Which is more accurate, caption embeddings themselves or the combination of image and caption embeddings? 4. **Ineffability**: Through the comparison of these models, the researchers hope to better understand the "ineffable" part in aesthetic experiences, that is, whether those perceptual computations that cannot be fully described in words provide a sufficient basis for aesthetic experiences. ### Research background Aesthetic experience (the sense of beauty) is a universal phenomenon, but there is no consensus on its mechanisms, functions, and structures. For centuries, starting from ancient times, people have been debating why we feel beauty and where beauty comes from. A central theme is ineffability, that is, the extent to which we can describe beauty experiences in natural language. Due to the subjectivity of emotional self - reports, researchers try to better operationalize ineffability by locating or attributing it to a specific process. ### Research methods The authors used 900 images from the OASIS dataset, which cover arousal and valence scores on a 7 - point scale and were later added with aesthetic scores. The score for each image is the average of 100 to 110 raters. To predict these group - averaged aesthetic scores, they used a cross - validated regularized linear regression method to predict based on features extracted from pre - trained deep neural network models that have never been trained for aesthetic prediction. ### Research results - **Unimodal vision models**: Pure unimodal vision models (such as contrast - learning models) can explain up to 75% of the explainable variance. - **Multimodal vision models**: The CLIP model (a language - aligned model) shows a significant gain in explaining the explainable variance, reaching 80.5% to 87%. - **Language models through captions**: By converting visual embeddings into language embeddings and generating captions, it was found that language processing does not provide additional decoding capabilities, but machine - generated captions can still explain a certain proportion of the variation in aesthetic scores. ### Discussion The research shows that although perceptual processes (such as feed - forward, hierarchical pre - symbolic visual feature extraction) are currently the best predictors of aesthetic scores, language (alignment) may play a statistically significant role in shaping these representations. In addition, the successful prediction of visual semantics can be partially translated into machine - generated natural language descriptions, indicating that aesthetic ineffability may be a gradual process rather than binary (effable or ineffable).

Using Multimodal Deep Neural Networks to Disentangle Language from Visual Aesthetics

Visualizing and Understanding Neural Models in NLP

Revealing Vision-Language Integration in the Brain with Multimodal Networks

MindGPT: Interpreting What You See with Non-invasive Brain Recordings

Decoding Visual Neural Representations by Multimodal Learning of Brain-Visual-Linguistic Features

Modelling Multimodal Integration in Human Concept Processing with Vision-and-Language Models

Sparsity in an artificial neural network predicts beauty: Towards a model of processing-based aesthetics

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Emotion Recognition with Feature Extracted from the Manifold of Brain Networks

Decoding Visual Experience and Mapping Semantics through Whole-Brain Analysis Using fMRI Foundation Models

Describe-and-Dissect: Interpreting Neurons in Vision Networks with Language Models

From Sight to Insight: A Multi-task Approach with the Visual Language Decoding Model

DeViL: Decoding Vision features into Language

From Captions to Visual Concepts and Back

Large language models predict human sensory judgments across six modalities

Language as the Medium: Multimodal Video Classification through text only

Probing the link between vision and language in material perception using psychophysics and unsupervised learning

Disentangled deep generative models reveal coding principles of the human face processing network

Leverage Points in Modality Shifts: Comparing Language-only and Multimodal Word Representations

MindSemantix: Deciphering Brain Visual Experiences with a Brain-Language Model