Using Multimodal Deep Neural Networks to Disentangle Language from Visual Aesthetics

Colin Conwell,Christopher Hamblin,Chelsea Boccagno,David Mayo,Jesse Cummings,Leyla Isik,Andrei Barbu
2024-10-31
Abstract:When we experience a visual stimulus as beautiful, how much of that experience derives from perceptual computations we cannot describe versus conceptual knowledge we can readily translate into natural language? Disentangling perception from language in visually-evoked affective and aesthetic experiences through behavioral paradigms or neuroimaging is often empirically intractable. Here, we circumnavigate this challenge by using linear decoding over the learned representations of unimodal vision, unimodal language, and multimodal (language-aligned) deep neural network (DNN) models to predict human beauty ratings of naturalistic images. We show that unimodal vision models (e.g. SimCLR) account for the vast majority of explainable variance in these ratings. Language-aligned vision models (e.g. SLIP) yield small gains relative to unimodal vision. Unimodal language models (e.g. GPT2) conditioned on visual embeddings to generate captions (via CLIPCap) yield no further gains. Caption embeddings alone yield less accurate predictions than image and caption embeddings combined (concatenated). Taken together, these results suggest that whatever words we may eventually find to describe our experience of beauty, the ineffable computations of feedforward perception may provide sufficient foundation for that experience.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to explore the relationship between perception and language in visual aesthetic experiences (i.e., our sense of beauty). Specifically, the authors attempt to answer the following questions: 1. **Relationship between perception and language**: When we find a certain visual stimulus beautiful, how much of this experience is determined by perceptual computations that we cannot describe in words, and how much is determined by conceptual knowledge that we can express in natural language? 2. **Disentangling perception and language**: It is often empirically difficult to disentangle the roles of perception and language in visually - induced emotions and aesthetic experiences through behavioral paradigms or neuroimaging methods. Therefore, the authors propose a new method, using linear decoding techniques to evaluate the performance of different types of deep neural network models (unimodal vision, unimodal language, and multimodal language - aligned models) in predicting human aesthetic scores for natural images. 3. **Model performance**: - Can unimodal vision models (such as SimCLR) explain most of the variation in aesthetic scores? - Can language - aligned vision models (such as SLIP) bring additional gains? - Can captions generated by combining unimodal language models (such as GPT2) with visual embeddings further improve prediction accuracy? - Which is more accurate, caption embeddings themselves or the combination of image and caption embeddings? 4. **Ineffability**: Through the comparison of these models, the researchers hope to better understand the "ineffable" part in aesthetic experiences, that is, whether those perceptual computations that cannot be fully described in words provide a sufficient basis for aesthetic experiences. ### Research background Aesthetic experience (the sense of beauty) is a universal phenomenon, but there is no consensus on its mechanisms, functions, and structures. For centuries, starting from ancient times, people have been debating why we feel beauty and where beauty comes from. A central theme is ineffability, that is, the extent to which we can describe beauty experiences in natural language. Due to the subjectivity of emotional self - reports, researchers try to better operationalize ineffability by locating or attributing it to a specific process. ### Research methods The authors used 900 images from the OASIS dataset, which cover arousal and valence scores on a 7 - point scale and were later added with aesthetic scores. The score for each image is the average of 100 to 110 raters. To predict these group - averaged aesthetic scores, they used a cross - validated regularized linear regression method to predict based on features extracted from pre - trained deep neural network models that have never been trained for aesthetic prediction. ### Research results - **Unimodal vision models**: Pure unimodal vision models (such as contrast - learning models) can explain up to 75% of the explainable variance. - **Multimodal vision models**: The CLIP model (a language - aligned model) shows a significant gain in explaining the explainable variance, reaching 80.5% to 87%. - **Language models through captions**: By converting visual embeddings into language embeddings and generating captions, it was found that language processing does not provide additional decoding capabilities, but machine - generated captions can still explain a certain proportion of the variation in aesthetic scores. ### Discussion The research shows that although perceptual processes (such as feed - forward, hierarchical pre - symbolic visual feature extraction) are currently the best predictors of aesthetic scores, language (alignment) may play a statistically significant role in shaping these representations. In addition, the successful prediction of visual semantics can be partially translated into machine - generated natural language descriptions, indicating that aesthetic ineffability may be a gradual process rather than binary (effable or ineffable).