Abstract:We present ``The Concatenator,'' a real time system for audio-guided concatenative synthesis. Similarly to Driedger et al.'s ``musaicing'' (or ``audio mosaicing'') technique, we concatenate a set number of windows within a corpus of audio to re-create the harmonic and percussive aspects of a target audio stream. Unlike Driedger's NMF-based technique, however, we instead use an explicitly Bayesian point of view, where corpus window indices are hidden states and the target audio stream is an observation. We use a particle filter to infer the best hidden corpus states in real-time. Our transition model includes a tunable parameter to control the time-continuity of corpus grains, and our observation model allows users to prioritize how quickly windows change to match the target. Because the computational complexity of the system is independent of the corpus size, our system scales to corpora that are hours long, which is an important feature in the age of vast audio data collections. Within The Concatenator module itself, composers can vary grain length, fit to target, and pitch shift in real time while reacting to the sounds they hear, enabling them to rapidly iterate ideas. To conclude our work, we evaluate our system with extensive quantitative tests of the effects of parameters, as well as a qualitative evaluation with artistic insights. Based on the quality of the results, we believe the real-time capability unlocks new avenues for musical expression and control, suitable for live performance and modular synthesis integration, which furthermore represents an essential breakthrough in concatenative synthesis technology.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? The paper "The Concatenator: A Bayesian Approach To Real Time Concatenative Musaicing" aims to solve several key challenges in real - time audio - guided concatenative synthesis. Specifically, it attempts to solve the following problems: 1. **Real - time performance**: - The paper proposes a system named "Concatenator" that is capable of performing audio - guided concatenative synthesis in a real - time environment. Unlike traditional non - negative matrix factorization (NMF) - based methods, this method does not require pre - training and can adapt to any audio corpus at runtime. 2. **Large - scale audio data processing**: - Modern music producers are faced with a large amount of audio data, which may come from cloud services, multi - sample libraries, or their own multi - track recordings, etc. The method proposed in the paper can handle audio corpora that are several hours long without significantly increasing the computational complexity. 3. **Enhancing musical expression and control**: - By introducing Bayesian inference and particle filters, the Concatenator system allows users to adjust the granularity length, match the target track, and perform pitch conversion in a real - time environment. This provides new musical expression methods and control means for music creators, and is especially suitable for live performances and modular synthesis integration. 4. **Maintaining timbre characteristics**: - When pursuing better spectral fitting, traditional methods may lose the timbre characteristics of the original audio fragments. The method proposed in the paper achieves reasonable spectral fitting while maintaining timbre characteristics by adjusting parameters (such as the time continuity parameter \( p_d \) and the temperature parameter \( \tau \)). 5. **Optimizing computational efficiency**: - The algorithm proposed in the paper is independent of the corpus size in terms of computational complexity, thus enabling effective processing of large - scale corpora. For example, for a 60 - minute corpus, using 1000 particles and 5 windows per particle, the speed is nearly 30 times faster compared to Driedger's method. ### Formula summary - **KL - divergence loss function**: \[ D(V \| WH) = \sum_{m,t} V_{mt} \odot \log\left(\frac{V_{mt}}{(WH)_{mt}}\right) - V_{mt} + (WH)_{mt} \] where \( V \) is the target spectrogram, \( W \) is the corpus spectral template, and \( H \) is the learned activation matrix. - **Update rule**: \[ H^\ell_{kt} = H^{\ell - 1}_{kt} \left( \frac{\sum_m W_{mk} V_{mt} / (WH^{\ell - 1})_{mt}}{\sum_m W_{mk}} \right) \] - **State transition probability**: \[ p_T(\vec{s}_t = \vec{b} | \vec{s}_{t - 1} = \vec{a}) = \prod_{k = 0}^{p - 1} \begin{cases} p_d & \text{if } b[k] = a[k] + 1 \\ \frac{1 - p_d}{N - 1} & \text{otherwise} \end{cases} \] - **Observation probability**: \[ p_O[i] = \frac{e^{-\tau D_i}}{\sum_j e^{-\tau D_j}} \] where \( D_i \) is the KL loss between the spectral approximation \( \vec{\Lambda}_i \) of the \( i \) - th particle and the target \( \vec{v}_t \). Through these improvements, the Concatenator system not only improves the real - time performance but also has better performance in other aspects.

The Concatenator: A Bayesian Approach To Real Time Concatenative Musaicing

MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation

Unitnet: A Sequence-To-Sequence Acoustic Model For Concatenative Speech Synthesis

Computer Assisted Composition in Continuous Time

ByteComposer: a Human-like Melody Composition Method based on Language Model Agent

Msanii: High Fidelity Music Synthesis on a Shoestring Budget

Composer's Assistant 2: Interactive Multi-Track MIDI Infilling with Fine-Grained User Control

Realistic Visual Speech Synthesis Based on Hybrid Concatenation Method

Anticipatory Music Transformer

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models

Simple and Controllable Music Generation

Interactive Melody Generation System for Enhancing the Creativity of Musicians

Local deployment of large-scale music AI models on commodity hardware

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

MusicHiFi: Fast High-Fidelity Stereo Vocoding

Exploiting Time-Frequency Conformers for Music Audio Enhancement

VQCPC-GAN: Variable-Length Adversarial Audio Synthesis Using Vector-Quantized Contrastive Predictive Coding

Bass Accompaniment Generation via Latent Diffusion

MixCon: A Hybrid Architecture for Efficient and Adaptive Sequence Modeling

Vivo : une approche multimodale de la synthese concatenative par corpus dans le cadre d'une oeuvre audiovisuelle immersive