Abstract:When recorders are used to survey acoustically conspicuous species, identification calls of the target species in recordings is essential for estimating density and abundance. We investigate how well deep neural networks identify vocalisations consisting of phrases of varying lengths, each containing a variable number of syllables. We use recordings of Hainan gibbon Nomascus hainanus vocalisations to develop and test the methods. We propose two methods for exploiting the two-level structure of such data. The first combines convolutional neural network (CNN) models with a hidden Markov model (HMM) and the second uses a convolutional recurrent neural network (CRNN). Both models learn acoustic features of syllables via a CNN and temporal correlations of syllables into phrases either via an HMM or recurrent network. We compare their performance to commonly used CNNs LeNet and VGGNet, and support vector machine (SVM). We also propose a dynamic programming method to evaluate how well phrases are predicted. This is useful for evaluating performance when vocalisations are labelled by phrases, not syllables. Our methods perform substantially better than the commonly used methods when applied to the gibbon acoustic recordings. The CRNN has an F-score of 90% on phrase prediction, which is 18% higher than the best of the SVM or LeNet and VGGNet methods. HMM post-processing raised the F-score of these last three methods to as much as 87%. The number of phrases is overestimated by CNNs and SVM, leading to error rates between 49% and 54%. With HMM, these error rates can be reduced to 0.4% at the lowest. Similarly, the error rate of CRNN's prediction is no more than 0.5%. CRNNs are better at identifying phrases of varying lengths composed of a varying number of syllables than simpler CNN or SVM models. We find a CRNN model to be best at this task, with a CNN combined with an HMM performing almost as well. We recommend that these kinds of models are used for species whose vocalisations are structured into phrases of varying lengths.

Labeling Unsegmented Sequence Data with DNN-HMM and Its Application for Speech Recognition.

Decision tree based state tying for speech recognition using DNN derived embeddings

Phonotactic language recognition based on DNN-HMM acoustic model

Improving deep neural network acoustic models using unlabeled data

Standalone Training of Context-Dependent Deep Neural Network Acoustic Models

Context-dependent Deep Neural Networks for audio indexing of real-life data

State-Clustering Based Multiple Deep Neural Networks Modeling Approach for Speech Recognition

An Investigation of High-Resolution Modeling Units of Deep Neural Networks for Acoustic Scene Classification

Conversational Speech Transcription Using Context-Dependent Deep Neural Networks

Acceleration Strategies for Speech Recognition Based on Deep Neural Networks

Automated Call Detection for Acoustic Surveys with Structured Calls of Varying Length

A Cluster-Based Multiple Deep Neural Networks Method for Large Vocabulary Continuous Speech Recognition

Hybrid Deep Neural Network--Hidden Markov Model (DNN-HMM) Based Speech Emotion Recognition

Learning Phone Recognition from Unpaired Audio and Phone Sequences Based on Generative Adversarial Network

Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: Why DNN surpasses GMMS in acoustic modeling

Modeling Speaker Variability Using Long Short-Term Memory Networks For Speech Recognition

Improving deep neural networks for LVCSR using dropout and shrinking structure

Completely Unsupervised Phoneme Recognition By A Generative Adversarial Network Harmonized With Iteratively Refined Hidden Markov Models

Robust Speech Recognition With Speech Enhanced Deep Neural Networks

Improving Deep Neural Networks Based Speaker Verification Using Unlabeled Data

DNN-based Stochastic Postfilter for HMM-based Speech Synthesis