Abstract:Soundscape ecologists aim to study the acoustic characteristics of an area that reflects natural processes [Schafer, 1977]. These sounds can be interpreted as biological (biophony), geophysical (geophony), and human-produced (anthrophony) [Pijanowski et al., 2011]. A common task is to use sounds to identify species based on the frequency content of a given signal. This signal can be further converted into spectrograms enabling other types of analysis to automate the identification of species. Based on the promising results of deep learning methods, such as Convolution Neural Networks (CNNs) in image classification, here we propose the use of a pre-trained VGG16 CNN architecture to identify two nocturnal avian species, namely Antrostomus rufus and Megascops choliba, commonly encountered in Brazilian forests. Monitoring the abundance of these species is important to ecologists to develop conservation programmes, detect environmental disturbances and assess the impact of human action. Specialists recorded sounds in 16-bit wave files at a sampling rate of 44Hz and classified the presence of these species. With the classified wave files, we created additional classes to visualise the performance of the VGG16 CNN architecture for detecting both species. We end up with six categories containing 60 seconds of audio of species vocalisation combinations and background only sounds. We produced spectrograms using the information from each RGB channel, only one channel (grey-scale), and applied the histogram equalisation technique to the grey-scale images. A comparison of the system performance using histogram equalised images and unmodified images was made. Histogram equalisation improves the contrast, and so the visibility to the human observer. Investigating the effect of histogram equalisation on the performance of the CNN was a feature of this study. Moreover, to show the practical application of our work, we created 51 minutes of audio, which contains more noise than the presence of both species (a scenario commonly encountered in field surveys). Our results showed that the trained VGG16 CNN produced, after 8000 epochs, a training accuracy of 100% for the three approaches. The test accuracy was 80.64%, 75.26%, and 67.74% for the RGB, grey-scaled, and histogram equalised approaches. The method’s accuracy on the synthetic audio file of 51 minutes was 92.15%. This accuracy level reveals the potential of CNN architectures in automating species detection and identification by sound using passive monitoring. Our results suggest that using coloured images to represent the spectrogram better generalises the classification than grey-scale and histogram equalised images. This study might develop future avian monitoring programmes based on passive sound recording, which significantly enhances sampling size without increasing cost.

An Initial study on Birdsong Re-synthesis Using Neural Vocoders

Bioacoustic classification of avian calls from raw sound waveforms with an open-source deep learning architecture

AVN: A Deep Learning Approach for the Analysis of Birdsong

In-the-wild Speech Emotion Conversion Using Disentangled Self-Supervised Representations and Neural Vocoder-based Resynthesis

Virtual Vocalization Stimuli for Investigating Neural Representations of Species-Specific Vocalizations.

Animal speech and singing synthesis model based on So-VITS-SVC

Analysis by Adversarial Synthesis -- A Novel Approach for Speech Vocoding

A machine vision system for avian song classification with CNN’s

Bidirectional Generative Adversarial Representation Learning for Natural Stimulus Synthesis

Bird song comparison using deep learning trained from avian perceptual judgments

A Synthetic Corpus Generation Method for Neural Vocoder Training

A Fast High-Fidelity Source-Filter Vocoder with Lightweight Neural Modules.

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

Global birdsong embeddings enable superior transfer learning for bioacoustic classification

Hierarchical RNNs for Waveform-Level Speech Synthesis

Speech denoising by parametric resynthesis

HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis

LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders

Evaluation of the Speech Resynthesis Capabilities of the VoicePrivacy Challenge Baseline B1

Active Bird2Vec: Towards End-to-End Bird Sound Monitoring with Transformers

An Empirical Study on End-to-End Singing Voice Synthesis with Encoder-Decoder Architectures