pyAMPACT: A Score-Audio Alignment Toolkit for Performance Data Estimation and Multi-modal Processing

Johanna Devaney,Daniel McKemie,Alex Morgan
2024-12-07
Abstract:pyAMPACT (Python-based Automatic Music Performance Analysis and Comparison Toolkit) links symbolic and audio music representations to facilitate score-informed estimation of performance data in audio as well as general linking of symbolic and audio music representations with a variety of annotations. pyAMPACT can read a range of symbolic formats and can output note-linked audio descriptors/performance data into MEI-formatted files. The audio analysis uses score alignment to calculate time-frequency regions of importance for each note in the symbolic representation from which to estimate a range of parameters. These include tuning-, dynamics-, and timbre-related performance descriptors, with timing-related information available from the score alignment. Beyond performance data estimation, pyAMPACT also facilitates multi-modal investigations through its infrastructure for linking symbolic representations and annotations to audio.
Sound,Multimedia,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the alignment between symbols (music scores) and audio representations in the estimation of music performance data and multimodal processing. Specifically, the pyAMPACT toolkit aims to: 1. **Align symbolic music representations with audio representations**: By linking symbolic music (such as music scores) and audio representations, audio analysis based on music scores can be achieved. This enables the extraction of performance data corresponding to music scores from the audio. 2. **Estimate performance data**: Estimate a series of performance - related parameters from the audio, including pitch, dynamics, and timbre, etc. These parameters can help researchers better understand the nuances in music performance. 3. **Support multimodal processing**: By combining symbolic representations, audio, and other annotation information, support more complex multimodal analysis tasks. For example, information from different sources such as music scores, audio, and motion - capture data of performers can be integrated for research. 4. **Expand functionality and compatibility**: Compared with previous tools (such as AMPACT), pyAMPACT supports more types of symbolic formats (such as Humdrum, MEI, MIDI, MusicXML, etc.), and can read multiple annotation coding formats (such as Dezrann, the analysis spine of Humdrum, etc.). In addition, it also provides more powerful visualization and data export functions. ### Solutions to specific problems - **Alignment of symbols and audio**: Use the Dynamic Time Warping (DTW) algorithm to align symbolic and audio representations, thereby accurately estimating the time - frequency regions of each note. - **Estimation of performance parameters**: - **Pitch - related descriptors**: Include the average fundamental frequency \( f_0 \), perceived pitch, vibrato rate, and vibrato depth, etc. - The average fundamental frequency \( f_0 \) is calculated using the geometric mean: \[ \text{mean } f_0=\left(\prod_{i = 1}^{N} f_{0,i}\right)^{\frac{1}{N}} \] - The perceived pitch is calculated based on the weighted average, and the weights depend on the frequency change rate: \[ \text{perceived pitch}=\sum_{i = 1}^{N} w_i\cdot f_{0,i} \] where the weights \( w_i \) depend on the frequency change rate and include a \( \gamma \) value that controls the dynamic range of the weights. - The vibrato depth \( E \) and vibrato rate \( R \) are calculated by the following formulas respectively: \[ E = 2\cdot\max_k|X(k)| \] \[ R = f_n\cdot\arg\max_k|X(k)| \] - **Dynamics - related descriptors**: Include the average power and shimmer, where the average power is calculated using the arithmetic mean: \[ \text{mean power}=\frac{1}{N}\sum_{i = 1}^{N} P_i \] - **Timbre - related descriptors**: Estimate various spectral features (such as bandwidth, centroid, contrast, flatness, and attenuation point) from the harmonic spectral representation, and take the arithmetic mean as the summary descriptor. - **Multimodal processing**: By supporting the import of multiple annotation formats and providing flexible data export functions, pyAMPACT provides the infrastructure for multimodal processing. In general, pyAMPACT aims to provide a more powerful and flexible tool for music information retrieval and musicology research by improving the alignment of symbols and audio and the estimation of performance parameters.