Few-Shot Sound Event Detection

Wang, Yu; Bryan, Nicholas J.; Bello, Juan Pablo

doi:10.1109/icassp40776.2020.9054708

Cited by 53 publications

(32 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, in a few approaches, acoustic data have been utilized to generate few-shot learning models. In the work of Wang [40], a metric-based few-shot learning method has been proposed for AER due to high cost of listening to a mixed sound to label each location of an event. Another few-shot learning approach based on the acoustic data has used an Attentional Graph Neural Network [41].…”

Section: Related Workmentioning

confidence: 99%

Gerçek Ortamlarda Artımlı Öğrenme ile Gerçek Zamanlı İşitsel Sahne Analizi

Bayram¹,

İnce²

2020

European Journal of Science and Technology

View full text Add to dashboard Cite

Continual learning for scene analysis is a continuous process to incrementally learn distinct events, actions, and even noise models from past experiences using different sensory modalities. In this paper, an Auditory Scene Analysis (ASA) approach based on a continual learning system is developed to incrementally learn the acoustic events in a dynamically-changing domestic environment. The events being salient sound sources are localized by a Sound Source Localization (SSL) method to robustly process the signals of the localized sound source in the domestic scene where multiple sources can co-exist. For real-time ASA, audio patterns are segmented from the acoustic signal stream of the localized source for extraction of the audio features, and construction of a feature set for each pattern. The continual learning is employed via a time-series algorithm, Hidden Markov Model (HMM), on these feature sets from acoustic signals stemming from the sources. The learning process is investigated by conducting a variety of experiments to evaluate the performance of Unknown Event Detection (UED), Acoustic Event Recognition (AER), and continual learning using a Hierarchical HMM algorithm. The Hierarchical HMM consists of two layers: 1) a lower layer in which AER is performed using an HMM for each event and the event-wise likelihood thresholds; and 2) an upper layer in which UED is achieved by one HMM with a suspicion threshold through the audio features with their proto symbols stemming from the lower layer HMMs. We verified the effectiveness of the proposed system capable of continual learning, AER and UED in terms of False-Positive Rates, True-Positive Rates, recognition accuracy and computational time to meet the demands in a learning task of multiple events in real-time. The effectiveness of the AER system has been verified with high accuracy, and a short retraining time in real-time ASA having nine different sounds.

show abstract

Section: Related Workmentioning

confidence: 99%

Gerçek Ortamlarda Artımlı Öğrenme ile Gerçek Zamanlı İşitsel Sahne Analizi

Bayram¹,

İnce²

2020

European Journal of Science and Technology

View full text Add to dashboard Cite

show abstract

“…As an alternative, few-shot learning [9][10][11][12][13][14] has been applied to audio classification [15][16][17] and sound event detection [18,19], where a classifier must learn to recognize a novel class from very few examples. Among different few-shot learning methods, metricbased prototypical networks [12] have been shown to yield excellent performance for audio [15,18,19]. However, few-shot methods do not maintain the training data class vocabulary, requiring manual labeling of all novel classes for deployment, which can be overwhelming for large vocabulary problems.…”

Section: Introductionmentioning

confidence: 99%

Few-Shot Continual Learning for Audio Classification

Wang

Bryan

Cartwright

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Supervised learning for audio classification typically imposes a fixed class vocabulary, which can be limiting for real-world applications where the target class vocabulary is not known a priori or changes dynamically. In this work, we introduce a few-shot continual learning framework for audio classification, where we can continuously expand a trained base classifier to recognize novel classes based on only few labeled data at inference time. This enables fast and interactive model updates by end-users with minimal human effort. To do so, we leverage the dynamic few-shot learning technique and adapt it to a challenging multi-label audio classification scenario. We incorporate a recent state-of-the-art audio feature extraction model as a backbone and perform a comparative analysis of our approach on two popular audio datasets (ESC-50 and AudioSet). We conduct an in-depth evaluation to illustrate the complexities of the problem and show that, while there is still room for improvement, our method outperforms three baselines on novel class detection while maintaining its performance on base classes.

show abstract

“…There are some conventional methods of SED in the case of imbalanced data [16,17,18]. For example, Chen and Jin have proposed a method of detecting rare sound events using data augmentation [16].…”

Section: Introductionmentioning

confidence: 99%

“…For example, Chen and Jin have proposed a method of detecting rare sound events using data augmentation [16]. Wang et al have proposed a method of few-shot SED based on metric learning [17]. Dinkel and Yu have proposed a method of SED using a temporal subsampling method within a CRNN [18].…”

Section: Introductionmentioning

confidence: 99%

Impact of Sound Duration and Inactive Frames on Sound Event Detection Performance

Imoto

Mishima²,

Arai³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In many methods of sound event detection (SED), a segmented time frame is regarded as one data sample to model training. The durations of sound events greatly depend on the sound event class, e.g., the sound event "fan" has a long duration, whereas the sound event "mouse clicking" is instantaneous. Thus, the difference in the duration between sound event classes results in a serious data imbalance in SED. Moreover, most sound events tend to occur occasionally; therefore, there are many more inactive time frames of sound events than active frames. This also causes a severe data imbalance between active and inactive frames. In this paper, we investigate the impact of sound duration and inactive frames on SED performance by introducing four loss functions, such as simple reweighting loss, inverse frequency loss, asymmetric focal loss, and focal batch Tversky loss. Then, we provide insights into how we tackle this imbalance problem.

show abstract

Few-Shot Sound Event Detection

Cited by 53 publications

References 15 publications

Gerçek Ortamlarda Artımlı Öğrenme ile Gerçek Zamanlı İşitsel Sahne Analizi

Gerçek Ortamlarda Artımlı Öğrenme ile Gerçek Zamanlı İşitsel Sahne Analizi

Few-Shot Continual Learning for Audio Classification

Impact of Sound Duration and Inactive Frames on Sound Event Detection Performance

Contact Info

Product

Resources

About