Abstract:Topological Data Analysis (TDA) has been successfully used for various tasks in signal/image processing, from visualization to supervised/unsupervised classification. Often, topological characteristics are obtained from persistent homology theory. The standard TDA pipeline starts from the raw signal data or a representation of it. Then, it consists in building a multiscale topological structure on the top of the data using a pre-specified filtration, and finally to compute the topological signature to be further exploited. The commonly used topological signature is a persistent diagram (or transformations of it). Current research discusses the consequences of the many ways to exploit topological signatures, much less often the choice of the filtration, but to the best of our knowledge, the choice of the representation of a signal has not been the subject of any study yet. This paper attempts to provide some answers on the latter problem. To this end, we collected real audio data and built a comparative study to assess the quality of the discriminant information of the topological signatures extracted from three different representation spaces. Each audio signal is represented as i) an embedding of observed data in a higher dimensional space using Taken's representation, ii) a spectrogram viewed as a surface in a 3D ambient space, iii) the set of spectrogram's zeroes. From vowel audio recordings, we use topological signature for three prediction problems: speaker gender, vowel type, and individual. We show that topologically-augmented random forest improves the Out-of-Bag Error (OOB) over solely based Mel-Frequency Cepstral Coefficients (MFCC) for the last two problems. Our results also suggest that the topological information extracted from different signal representations is complementary, and that spectrogram's zeros offers the best improvement for gender prediction.

Learning An Invariant Speech Representation

Adaptive Temporal Encoding Leads to a Background-Insensitive Cortical Representation of Speech

An invariant convolution model and its Variational Bayesian Approximation approach via Students-t priors for acoustic imaging in colored noises

Learning Invariant Representation and Risk Minimized for Unsupervised Accent Domain Adaptation

Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling

Toward Domain-Invariant Speech Recognition via Large Scale Training

BYOL for Audio: Exploring Pre-Trained General-Purpose Audio Representations

Contemporary issues in the fight against blood doping in sport.

Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy Reduction

An Attribute-Aligned Strategy for Learning Speech Representation

Invariant Representations in Deep Learning for Optoacoustic Imaging

A Generic Self-Supervised Framework of Learning Invariant Discriminative Features

Formation Of An Auditory Map For Invariant Perception Of Vowel Sounds: Listening To A Variety Of Speakers To Make Unified Vowel Representation

Orthogonality and isotropy of speaker and phonetic information in self-supervised speech representations

Speaker-Independent Acoustic-to-Articulatory Speech Inversion

TIPAA-SSL: Text Independent Phone-to-Audio Alignment based on Self-Supervised Learning and Knowledge Transfer

Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations

Adversarial Learning of Raw Speech Features for Domain Invariant Speech Recognition

Topological data analysis of human vowels: Persistent homologies across representation spaces

Audio-Visual Model Distillation Using Acoustic Images

Improving Speech Inversion Through Self-Supervised Embeddings and Enhanced Tract Variables