Mel-frequency cepstral coefficients outperform embeddings from pre-trained convolutional neural networks under noisy conditions for discrimination tasks of individual gibbons

Mohamed Walid Lakdari,Abdul Hamid Ahmad,Sarab Sethi,Gabriel A. Bohn,Dena J. Clink
DOI: https://doi.org/10.1016/j.ecoinf.2023.102457
IF: 5.1
2024-01-06
Ecological Informatics
Abstract:Passive acoustic monitoring – an approach that utilizes autonomous acoustic recording units – allows for non-invasive monitoring of individuals, assuming that it is possible to acoustically distinguish individuals. However, identifying effective analytical approaches for individual identification remains a challenge. Our study investigates how the use of different feature representations impacts our ability to distinguish between individual female Northern grey gibbons ( Hylobates funereus ). We broadcast pre-recorded calls from twelve gibbon females and re-recorded the calls at varying distances (directly under the tree to ~400 m away) using autonomous recording units. We evaluated the effectiveness of using different automated feature extraction approaches to classify gibbon calls: Mel-frequency cepstral coefficients (MFCCs), embeddings from three pre-trained neural networks (BirdNET, VGGish, and Wav2Vec2), and four commonly used acoustic indices. We used a supervised classification approach (random forest) to classify calls to the respective female and compared two unsupervised clustering approaches (affinity propagation clustering and hierarchical density-based spatial clustering) to evaluate which features were most effective for distinguishing female calls without using class labels. We used MFCCs as a baseline as previous work has shown they can be used to distinguish high-quality calls of individual gibbon females. Human annotators could only identify calls in spectrograms from recordings 10 dB), while the remaining features did not perform well. Contrary to our expectations, we found that MFCCs outperformed all other features for the unsupervised clustering tasks at closer distances and none of the features performed well at farther distances. The ability to acoustically discriminate animals under noisy conditions and from low signal-to-noise ratio calls has important implications for monitoring populations of endangered animals, such as gibbons. Focusing only on high signal-to-noise ratio calls for individual discrimination may not be possible for rare sounds, and future work should focus on developing effective approaches of feature extraction that can perform well across noisy, real-world conditions with a limited number of training samples.
ecology
What problem does this paper attempt to address?