Abstract:This paper demonstrates automatic recognition of vocalizations of four common bird species (herring gull [Larus argentatus], blue jay [Cyanocitta cristata], Canada goose [Branta canadensis], and American crow [Corvus brachyrhynchos]) using an algorithm that extracts frequency track sets using track properties of importance and harmonic correlation. The main result is that a complex harmonic vocalization is rendered into a set of related tracks that is easily applied to statistical models of the actual bird voc… Show more
“…None of the parameters in Table II have been systematically optimized, other than the neural network thresholds. The merging of widely separated harmonic components into a single "transient" event could be improved further (e.g., Heller and Pinezich, 2008). There are also indications that each site should have its own dedicated neural network, trained with data from that site, instead of applying a common network trained with data from all sites.…”
Section: Discussionmentioning
confidence: 97%
“…There is a growing literature on using image processing and other techniques to extract features from frequency-modulated signals (Sturtivant and Datta, 1995;Datta and Sturtivant, 2002;Lammers et al, 2003;Oswald et al, 2007;Roch et al, 2007;Asitha et al, 2008;Madhusudhana et al, 2008;Top, 2009), but this area is still an active research topic (Lampert and O'Keefe, 2010a,b), and methods for handling sidebands remain underdeveloped (Heller and Pinezich, 2008).…”
Section: Image Processing and Feature Extractionmentioning
An automated procedure has been developed for detecting and localizing frequency-modulated bowhead whale sounds in the presence of seismic airgun surveys. The procedure was applied to four years of data, collected from over 30 directional autonomous recording packages deployed over a 280 km span of continental shelf in the Alaskan Beaufort Sea. The procedure has six sequential stages that begin by extracting 25-element feature vectors from spectrograms of potential call candidates. Two cascaded neural networks then classify some feature vectors as bowhead calls, and the procedure then matches calls between recorders to triangulate locations. To train the networks, manual analysts flagged 219 471 bowhead call examples from 2008 and 2009. Manual analyses were also used to identify 1.17 million transient signals that were not whale calls. The network output thresholds were adjusted to reject 20% of whale calls in the training data. Validation runs using 2007 and 2010 data found that the procedure missed 30%-40% of manually detected calls. Furthermore, 20%-40% of the sounds flagged as calls are not present in the manual analyses; however, these extra detections incorporate legitimate whale calls overlooked by human analysts. Both manual and automated methods produce similar spatial and temporal call distributions.
“…None of the parameters in Table II have been systematically optimized, other than the neural network thresholds. The merging of widely separated harmonic components into a single "transient" event could be improved further (e.g., Heller and Pinezich, 2008). There are also indications that each site should have its own dedicated neural network, trained with data from that site, instead of applying a common network trained with data from all sites.…”
Section: Discussionmentioning
confidence: 97%
“…There is a growing literature on using image processing and other techniques to extract features from frequency-modulated signals (Sturtivant and Datta, 1995;Datta and Sturtivant, 2002;Lammers et al, 2003;Oswald et al, 2007;Roch et al, 2007;Asitha et al, 2008;Madhusudhana et al, 2008;Top, 2009), but this area is still an active research topic (Lampert and O'Keefe, 2010a,b), and methods for handling sidebands remain underdeveloped (Heller and Pinezich, 2008).…”
Section: Image Processing and Feature Extractionmentioning
An automated procedure has been developed for detecting and localizing frequency-modulated bowhead whale sounds in the presence of seismic airgun surveys. The procedure was applied to four years of data, collected from over 30 directional autonomous recording packages deployed over a 280 km span of continental shelf in the Alaskan Beaufort Sea. The procedure has six sequential stages that begin by extracting 25-element feature vectors from spectrograms of potential call candidates. Two cascaded neural networks then classify some feature vectors as bowhead calls, and the procedure then matches calls between recorders to triangulate locations. To train the networks, manual analysts flagged 219 471 bowhead call examples from 2008 and 2009. Manual analyses were also used to identify 1.17 million transient signals that were not whale calls. The network output thresholds were adjusted to reject 20% of whale calls in the training data. Validation runs using 2007 and 2010 data found that the procedure missed 30%-40% of manually detected calls. Furthermore, 20%-40% of the sounds flagged as calls are not present in the manual analyses; however, these extra detections incorporate legitimate whale calls overlooked by human analysts. Both manual and automated methods produce similar spatial and temporal call distributions.
“…The use of features extracted from entire frequency range, such as, conventional Melfrequency cepstral coefficients which were used in a number of studies, e.g., [1], is problematic in the presence of other concurrent vocalisations or noise. The use of a set of statistical descriptors to characterise detected segment, as employed in [1], [2], [5], may not capture well a more complex types of vocalisation elements and may be susceptable to inaccuracies in segmentation. In a case of tonal bird vocalisations, the use of a sinusoidal detection for segmentation also offers a natural way of representing the segment as a temporal sequence of the frequencies of the detected sinusoid, which we refer to as frequency track.…”
Section: Introductionmentioning
confidence: 99%
“…Typically, the first stage of an automatic system is to parse the acoustic signal into isolated spectro-temporal segments. This is often performed using an energy-based thresholding that requires an estimate of noise level, e.g., [1], or by decomposition into sinusoidal components [1], [2], [3], [4]. A variety of approaches to feature representation of the spectro-temporal segments and their modelling were explored.…”
Abstract-This paper presents an automatic system for detection of bird species in field recordings. A sinusoidal detection algorithm is employed to segment the acoustic scene into isolated spectro-temporal segments. Each segment is represented as a temporal sequence of frequencies of the detected sinusoid, referred to as frequency track. Each bird species is represented by a set of hidden Markov models (HMMs), each HMM modelling an individual type of bird vocalisation element. These HMMs are obtained in an unsupervised manner. The detection is based on a likelihood ratio of the test utterance against the target bird species and non-target background model. We explore on selection of cohort for modelling the background model, z-norm and t-norm score normalisation techniques and score compensation to deal with outlier data. Experiments are performed using over 40 hours of audio field recordings from 48 bird species plus an additional 16 hours of field recordings as impostor trials. Evaluations are performed using detection error trade-off plots. The equal error rate of 5% is achieved when impostor trials are non-target bird species vocalisations and 1.2% when using field recordings which do not contain bird vocalisations.
“…Spectral peak tracks (SPT) (also called frequency tracks) have been explored for studying birds [Heller andPinezich, 2008, Jancovic andKokuer, 2015] and whales [Roch et al, 2011]. In this chapter, the spectral peak track is used to represent the trace of a frog advertisement call, because frogs that are genetically related share more similar advertisement calls than distantly [Gingras and Fitch, 2013].…”
Frogs play an important role in Earth's ecosystem, but the decline of their population has been spotted at many locations around the world. Monitoring frog activity can assist conservation efforts, and improve our understanding of their interactions with the environment and other organisms. Traditional observation methods require ecologists and volunteers to visit the field, which greatly limit the scale for acoustic data collection. Recent advances in acoustic sensors provide a novel method to survey vocalising animals such as frogs. Once sensors are successfully installed in the field, acoustic data can be automatically collected at large spatial and temporal scales. For each acoustic sensor, several gigabytes of compressed audio data can be generated per day, and thus large volumes of raw acoustic data are collected.To gain insights about frogs and the environment, classifying frog species in acoustic data is necessary. However, manual species identification is unfeasible due to the large amount of collected data, and enabling automated species classification has become very important.Previous studies on signal processing and machine learning for frog call classification often have two limitations: (1) the recordings used to train and test classifiers are trophy recordings ( signal-to-noise ratio (SNR) (≥ 15 dB); (2) each individual recording is assumed to contain only one frog species. However, field recordings typically have a low SNR (< 15 dB) and contain multiple simultaneously vocalising frog species. This thesis aims to address two limitations and makes the following contributions.(1) Develop a combined feature set from temporal, perceptual, and cepstral domains for improving the state-of-the-art performance of frog call classification using trophy recordings (Chapter 3).(2) Propose a novel cepstral feature via adaptive frequency scaled wavelet packet decomposition (WPD) to improve cepstral feature's anti-noise ability for frog call classification using both trophy and field recordings (Chapter 4). v (3) Design a novel multiple-instance multiple-label (MIML) framework to classify multiple simultaneously vocalising frog species in field recordings (Chapter 5).(4) Design a novel multiple-label (ML) framework to increase the robustness of classification results when classifying multiple simultaneously vocalising frog species in field recordings (Chapter 6).Our proposed approaches achieve promising classification results compared with previous studies. With our developed classification techniques, the ecosystem at large spatial and temporal scales can be surveyed, which can help ecologists better understand the ecosystem.vi Keywords
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.