Abstract:We consider the problem of detecting, isolating and classifying elephant calls in continuously recorded audio. Such automatic call characterisation can assist conservation efforts and inform environmental management strategies. In contrast to previous work in which call detection was performed at a segment level, we perform call detection at a frame level which implicitly also allows call endpointing, the isolation of a call in a longer recording. For experimentation, we employ two annotated datasets, one containing Asian and the other African elephant vocalisations. We evaluate several shallow and deep classifier models, and show that the current best performance can be improved by using an audio spectrogram transformer (AST), a neural architecture which has not been used for this purpose before, and which we have configured in a novel sequence-to-sequence manner. We also show that using transfer learning by pre-training leads to further improvements both in terms of computational complexity and performance. Finally, we consider sub-call classification using an accepted taxonomy of call types, a task which has not previously been considered. We show that also in this case the transformer architectures provide the best performance. Our best classifiers achieve an average precision (AP) of 0.962 for framewise binary call classification, and an area under the receiver operating characteristic (AUC) of 0.957 and 0.979 for call classification with 5 classes and sub-call classification with 7 classes respectively. All of these represent either new benchmarks (sub-call classifications) or improvements on previously best systems. We conclude that a fully-automated elephant call detection and subcall classification system is within reach. Such a system would provide valuable information on the behaviour and state of elephant herds for the purposes of conservation and management.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the automatic detection, isolation and classification of elephant calls, especially in continuously recorded audio. Specifically, the author aims to achieve the following goals through deep - learning architectures: 1. **Automatic Detection and Classification of Elephant Calls**: Traditional manual or semi - automatic methods are inefficient and time - consuming, while an automated system can provide a more efficient and accurate solution. This not only helps to protect elephant populations but also can provide data support for environmental protection strategies. 2. **Frame - level Detection and Endpoint Detection**: Different from previous call detection at the segment level, this paper proposes detection at the frame level, which can not only detect the presence of a call but also accurately determine the start and end positions of the call (i.e., endpoint detection). This method makes it possible to isolate specific calls from longer recordings. 3. **Sub - call Classification**: In addition to the basic call - type classification, this paper also makes the first attempt to classify sub - calls, which is based on the accepted call - type taxonomy. Sub - call classification can provide more detailed insights into elephant behavior. 4. **Improving Performance and Efficiency**: By introducing new neural architectures (such as Audio Spectrogram Transformer, AST) and using transfer - learning techniques (including pre - training and self - supervised learning), the author hopes to significantly improve the performance of the model while reducing computational complexity. ### Main Contributions - **Using a New Neural Architecture**: In particular, the Transformer Encoder architecture, which uses learnable embedding tokens to distinguish different call types and is applied to processing elephant vocalisations for the first time. - **Cross - domain and Intra - domain Transfer Learning**: Transfer learning is applied to the elephant call detection and classification tasks for the first time. - **Explicit Call Segmentation**: Explicit call segmentation is performed in elephant call research for the first time. - **Sub - call Classification**: The first attempt to classify sub - calls is made as the first step towards automated elephant behavior classification. ### Experimental Setup To verify the effectiveness of these methods, the author uses two annotated datasets, one containing Asian elephant vocalisations and the other containing African elephant vocalisations. By comparing multiple shallow and deep classification models, the superiority of the new architectures and methods on multiple evaluation metrics is proven. ### Summary Through the above innovations, the author shows that it is feasible to build a fully automated system capable of detecting and classifying elephant calls. Such a system can provide valuable information for protecting and managing elephant populations and help researchers better understand the behavior and status of elephants.

Learning to rumble: Automated elephant call classification, detection and endpointing using deep architectures

Automatic Detection and Compression for Passive Acoustic Monitoring of the African Forest Elephant

Automated Call Detection for Acoustic Surveys with Structured Calls of Varying Length

Automated detection of gibbon calls from passive acoustic monitoring data using convolutional neural networks in the "torch for R" ecosystem

Automatic Sound Event Detection and Classification of Great Ape Calls Using Neural Networks

Automated detection of Bornean white-bearded gibbon (Hylobates albibarbis) vocalizations using an open-source framework for deep learning

Automated detection of Bornean white-bearded gibbon (Hylobates albibarbis) vocalisations using an open-source framework for deep learning

Ensemble deep learning and anomaly detection framework for automatic audio classification: Insights into deer vocalizations

Evaluating Machine Learning-Based Elephant Recognition in Complex African Landscapes Using Drone Imagery

ElephantBook: A Semi-Automated Human-in-the-Loop System for Elephant Re-Identification

Introducing a Central African Primate Vocalisation Dataset for Automated Species Classification

Seismic localization of elephant rumbles as a monitoring approach

Utilizing DeepSqueak for automatic detection and classification of mammalian vocalizations: a case study on primate vocalizations

Towards Automatic Identification of Elephants in the Wild

ANIMAL-SPOT enables animal-independent signal detection and classification using deep learning

Classification of animal sounds in a hyperdiverse rainforest using Convolutional Neural Networks

Automatic Bat Call Classification using Transformer Networks

Using Deep Learning to Classify Environmental Sounds in the Habitat of Western Black-Crested Gibbons

An integrated passive acoustic monitoring and deep learning pipeline for black‐and‐white ruffed lemurs (Varecia variegata) in Ranomafana National Park, Madagascar

animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics

Advanced Framework for Animal Sound Classification With Features Optimization