Content-Based Representations of Audio Using Siamese Neural Networks

Manocha, Pranay; Badlani, Rohan; Kumar, Anurag; Shah, Ankit; Elizalde, Benjamin; Raj, Bhiksha

doi:10.1109/icassp.2018.8461524

Cited by 31 publications

(22 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the inference step, by using the feature space, the input is classified to one of the target classes. Regarding deep metric learning in acoustic signal processing [17][18][19][20][21][22][23][24][25][26], we summarize an overview of tasks, loss functions, and sampling strategies, in Table 1. Manocha et al have worked on sound clip search task and used contrastive loss, where a feature space is learned based on a pair type that consists of the same class or different classes and a feature space distance [19].…”

Section: Related Workmentioning

confidence: 99%

Metric Learning with Background Noise Class for Few-Shot Detection of Rare Sound Events

Shimada

Koyama

Inoue

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Few-shot learning systems for sound event recognition gain interests since they require only a few examples to adapt to new target classes without fine-tuning. However, such systems have only been applied to chunks of sounds for classification or verification. In this paper, we aim to achieve few-shot detection of rare sound events, from long query sequence that contain not only the target events but also the other events and background noise. Therefore, it is required to prevent false positive reactions to both the other events and background noise. We propose metric learning with background noise class for the few-shot detection. The contribution is to present the explicit inclusion of background noise as a independent class, a suitable loss function that emphasizes this additional class, and a corresponding sampling strategy that assists training. It provides a feature space where the event classes and the background noise class are sufficiently separated. Evaluations on few-shot detection tasks, using DCASE 2017 task2 and ESC-50, show that our proposed method outperforms metric learning without considering the background noise class. The few-shot detection performance is also comparable to that of the DCASE 2017 task2 baseline system, which requires huge amount of annotated audio data.

show abstract

Section: Related Workmentioning

confidence: 99%

Metric Learning with Background Noise Class for Few-Shot Detection of Rare Sound Events

Shimada

Koyama

Inoue

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…The superiority of combining the deep learning approach with fingerprinting is demonstrated in [37], where a Siamese Neural Network (SNN) produced semantic representations of audio signals. SNNs have been applied to sound classification in [37], [38], and [39] and have the advantage over the canonical CNN in their ability to generalize.…”

Section: Introductionmentioning

confidence: 99%

Animal Sound Classification Using Dissimilarity Spaces

Nanni¹,

Brahnam²,

Lumini³

et al. 2020

Preprint

View full text Add to dashboard Cite

The classifier system proposed in this work combines the dissimilarity spaces produced by a set of Siamese neural networks (SNNs) designed using 4 different backbones, with different clustering techniques for training SVMs for automated animal audio classification. The system is evaluated on two animal audio datasets: one for cat and another for bird vocalizations. Different clustering methods reduce the spectrograms in the dataset to a set of centroids that generate (in both a supervised and unsupervised fashion) the dissimilarity space through the Siamese networks. In addition to feeding the SNNs with spectrograms, additional experiments process the spectrograms using the Heterogeneous Auto-Similarities of Characteristics. Once the similarity spaces are computed, a vector space representation of each pattern is generated that is then trained on a Support Vector Machine (SVM) to classify a spectrogram by its dissimilarity vector. Results demonstrate that the proposed approach performs competitively (without ad-hoc optimization of the clustering methods) on both animal vocalization datasets. To further demonstrate the power of the proposed system, the best stand-alone approach is also evaluated on the challenging Dataset for Environmental Sound Classification (ESC50) dataset. The MATLAB code used in this study is available at https://github.com/LorisNanni.

show abstract

“…Previous studies mainly focus on sound event detection (SED), investigating which sound events happen in an audio recording and when they occur [2]. In contrast, Sound event retrieval (SER) is retrieving audio recordings that are similar to a given input audio query [3,4]. This similarity can be based on acoustic and/or semantic (symbolic) characterization [5].…”

Section: Introductionmentioning

confidence: 99%

“…Previous audio retrieval research mainly focuses on either acoustic similarity or categorization [4][5][6][7][8]. We neither simply use SED techniques to classify sound and retrieve the label, nor simply adopt audio fingerprinting to measure similarity.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Multi-Label Sound Event Retrieval Using A Deep Learning-Based Siamese Structure With A Pairwise Presence Matrix

Fan

Nichols

Tompkins

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Realistic recordings of soundscapes often have multiple sound events co-occurring, such as car horns, engine and human voices. Sound event retrieval is a type of contentbased search aiming at finding audio samples, similar to an audio query based on their acoustic or semantic content. State of the art sound event retrieval models have focused on single-label audio recordings, with only one sound event occurring, rather than on multi-label audio recordings (i.e., multiple sound events occur in one recording). To address this latter problem, we propose different Deep Learning architectures with a Siamesestructure and a Pairwise Presence Matrix. The networks are trained and evaluated using the SONYC-UST dataset containing both single-and multi-label soundscape recordings. The performance results show the effectiveness of our proposed model.

show abstract

Content-Based Representations of Audio Using Siamese Neural Networks

Cited by 31 publications

References 19 publications

Metric Learning with Background Noise Class for Few-Shot Detection of Rare Sound Events

Metric Learning with Background Noise Class for Few-Shot Detection of Rare Sound Events

Animal Sound Classification Using Dissimilarity Spaces

Multi-Label Sound Event Retrieval Using A Deep Learning-Based Siamese Structure With A Pairwise Presence Matrix

Contact Info

Product

Resources

About