With Ears to See and Eyes to Hear: Sound Symbolism Experiments with Multimodal Large Language Models

Tyler Loakman,Yucheng Li,Chenghua Lin
2024-10-18
Abstract:Recently, Large Language Models (LLMs) and Vision Language Models (VLMs) have demonstrated aptitude as potential substitutes for human participants in experiments testing psycholinguistic phenomena. However, an understudied question is to what extent models that only have access to vision and text modalities are able to implicitly understand sound-based phenomena via abstract reasoning from orthography and imagery alone. To investigate this, we analyse the ability of VLMs and LLMs to demonstrate sound symbolism (i.e., to recognise a non-arbitrary link between sounds and concepts) as well as their ability to "hear" via the interplay of the language and vision modules of open and closed-source multimodal models. We perform multiple experiments, including replicating the classic Kiki-Bouba and Mil-Mal shape and magnitude symbolism tasks, and comparing human judgements of linguistic iconicity with that of LLMs. Our results show that VLMs demonstrate varying levels of agreement with human labels, and more task information may be required for VLMs versus their human counterparts for in silico experimentation. We additionally see through higher maximum agreement levels that Magnitude Symbolism is an easier pattern for VLMs to identify than Shape Symbolism, and that an understanding of linguistic iconicity is highly dependent on model size.
Computation and Language
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to explore whether large - language models (LLMs) and visual - language models (VLMs) can implicitly understand sound - based phenomena, especially sound symbolism, from text and image modalities through abstract reasoning. Specifically, the researchers hope to test the ability of these models in simulating human psycholinguistic phenomena through the following three main experiments: 1. **Shape Symbolism**: For example, the classic Kiki - Bouba effect, testing whether the model can correctly associate sharp or rounded shapes with specific pseudo - words (such as "Kiki" and "Bouba"). 2. **Magnitude Symbolism**: Similar to the Kiki - Bouba effect, but testing the model's understanding of the concept of magnitude, that is, whether the model can judge the size of an object according to different vowels (such as "Mil" and "Mal"). 3. **Iconicity Rating**: Evaluating the iconicity rating of a series of English words by the model, that is, the similarity between the form of the word and the concept it describes. ### Research Background and Motivation - **Sound symbolism** refers to the non - arbitrary association between speech and concepts. For example, onomatopoeia such as "bang", "shriek" and "bellow" imitate the concepts they describe through their pronunciation forms. - The researchers hope to verify whether LLMs and VLMs can indirectly acquire "phonetic" knowledge through orthographic and image information when only exposed to text and images, and show human - like sound - symbolism understanding ability. ### Experimental Design To verify the above hypotheses, the researchers conducted the following experiments: 1. **Shape Symbolism Experiment**: Use DALL - E 3 to generate a series of sharp or rounded entity images, and ask the model to select the pseudo - words that best describe these entities. The experimental results show that different models perform differently, and some models such as GPT - 4 perform well under certain conditions. 2. **Magnitude Symbolism Experiment**: Also use DALL - E 3 to generate entity images representing "small" or "large", and ask the model to select the pseudo - words that best describe these entities. The experimental results show that GPT - 4 shows a high human - consistency rate under multiple conditions. 3. **Iconicity Rating Experiment**: Use multiple modern LLMs to rate the iconicity of a set of English words and compare them with human ratings. The experiment found that the rating ability of the model is related to its number of parameters, and larger models such as GPT - 4 show a stronger correlation. ### Main Contributions - **Replicating Classic Experiments**: Replicate the classic Kiki - Bouba and Mil - Mal experiments through a series of open - source and closed - source VLMs, and explore the model's understanding of the association between phonetic and orthographic forms and entity features. - **In - depth Analysis**: Analyze the performance of multiple LLMs in the iconicity rating task and compare it with a large - scale human - rating data set. - **Discussing Potential Sources**: Explore the potential reasons for LLMs/VLMs to have sound - symbolism ability and the direction of their future improvement. Through these experiments, the researchers hope to reveal the potential mechanisms of multi - modal perception in language models and provide references for the development of more effective natural language processing algorithms, especially considering more abstract human perception levels in tasks such as sentiment analysis, emotion recognition and content generation.