Sketching With Your Voice: "Non-Phonorealistic" Rendering of Sounds via Vocal Imitation

Matthew Caren,Kartik Chandra,Joshua B. Tenenbaum,Jonathan Ragan-Kelley,Karima Ma
DOI: https://doi.org/10.1145/3680528.3687679
2024-09-20
Abstract:We present a method for automatically producing human-like vocal imitations of sounds: the equivalent of "sketching," but for auditory rather than visual representation. Starting with a simulated model of the human vocal tract, we first try generating vocal imitations by tuning the model's control parameters to make the synthesized vocalization match the target sound in terms of perceptually-salient auditory features. Then, to better match human intuitions, we apply a cognitive theory of communication to take into account how human speakers reason strategically about their listeners. Finally, we show through several experiments and user studies that when we add this type of communicative reasoning to our method, it aligns with human intuitions better than matching auditory features alone does. This observation has broad implications for the study of depiction in computer graphics.
Graphics,Computation and Language,Human-Computer Interaction,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper "Through Sound Sketches: Voice Imitation of 'Non - Phonorealistic' Sounds" attempts to solve the problem of how to convey sound experiences through voice imitation. Specifically, the authors focus on how to use the human voice to produce auditory representations similar to "sketches" that can effectively evoke specific sound perceptions in the minds of listeners. ### Background and Motivation In the visual field, people have studied how to abstract visual experiences through sketches (i.e., non - realistic paintings) to evoke specific perceptions in the minds of viewers. However, in the auditory field, similar "non - phonorealistic" research is relatively scarce. This paper attempts to fill this gap and explore how to imitate various sounds through the human voice and evoke corresponding perceptions in the minds of listeners. ### Main Contributions 1. **Method Design**: - The authors propose a method for automatically generating human voice imitations, which is based on the simulation of the human vocal cord model. - They first adjust the control parameters of the vocal cord model to make the synthesized voice perceptually match the target sound. - Then, in order to better match human intuition, they introduce the communication theory in cognitive science and consider how speakers strategically think about the listeners' reactions. 2. **Experimental Verification**: - Through multiple experiments and user studies, the authors verify the effectiveness of their method in generating human voice imitations. - The results show that the model with communication reasoning is more in line with human intuition than the model that only matches auditory features. 3. **Application Prospects**: - This work not only helps to generate human voice imitations but also provides a new perspective for understanding voice imitations. - The authors hope that this research can provide a basis for non - phonorealistic sound design and new intuitive interfaces, helping artists and designers to search for and create sounds more efficiently. ### Formula Explanation - **Probability Distribution of Feature Matching**: \[ p_S(u|r) \propto \exp\left(\beta \cdot \frac{\phi(r) \cdot \phi(u)}{|\phi(r)| |\phi(u)|}\right) \] where \( p_S(u|r) \) represents the probability that the speaker selects the voice \( u \) given the target sound \( r \); \( \phi \) represents the extracted auditory features; and \( \beta \) is a tuning parameter. - **Probability Distribution with Communication Reasoning**: \[ p_{S2}(u|r) \propto \exp(\beta \cdot p_L(r|u)) \] where \( p_{S2}(u|r) \) represents the probability that the speaker selects the voice \( u \) after considering communication reasoning; and \( p_L(r|u) \) represents the probability that the listener infers the target sound \( r \) after hearing the voice \( u \). - **Probability Distribution Considering Cost and Utility**: \[ p_{S2}(u|r) \propto \exp(\beta \cdot (V_2(u,r) - c(u))) \] where \( c(u) \) represents the cost of producing the voice \( u \); and \( V_2(u,r) \) represents the expected utility of the speaker conveying the target sound \( r \) by producing the voice \( u \). ### Conclusion Through the above methods, the authors successfully designed a system that can automatically generate human voice imitations, and this system has shown results that are highly consistent with human intuition in multiple experiments. This research not only provides new ideas for non - phonorealistic sound design but also opens up new directions for understanding and generating voice imitations.