Sound Event Detection and Time–Frequency Segmentation from Weakly Labelled Data

Kong, Qiuqiang; Xu, Yong; Sobieraj, Iwona; Wang, Wenwu; Plumbley, Mark D.

doi:10.1109/taslp.2019.2895254

Cited by 97 publications

(91 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Lms128 40 20 128 Fbank64 25 10 64 Regarding the AED features, 128 dimensional logmelspectra [14,15,16,17] were extracted. Here, a single frame is extracted every 20ms with a window size of 40ms (Table 4).…”

Section: Featurename Window Shift Dimensionmentioning

confidence: 99%

Audio Caption: Listen and Tell

Dinkel

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Increasing amount of research has shed light on machine perception of audio events, most of which concerns detection and classification tasks. However, human-like perception of audio scenes involves not only detecting and classifying audio sounds, but also summarizing the relationship between different audio events. Comparable research such as image caption has been conducted, yet the audio field is still quite barren. This paper introduces a manually-annotated dataset for audio caption. The purpose is to automatically generate natural sentences for audio scene description and to bridge the gap between machine perception of audio and image. The whole dataset is labelled in Mandarin and we also include translated English annotations. A baseline encoder-decoder model is provided for both English and Mandarin. Similar BLEU scores are derived for both languages: our model can generate understandable and data-related captions based on the dataset.

show abstract

Section: Featurename Window Shift Dimensionmentioning

confidence: 99%

Audio Caption: Listen and Tell

Dinkel

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…To evaluate the results of audio tagging, we follow the metrics proposed in [17]. The results are evaluated by precision, recall, F-score [19] and Area Under Curve (AUC) [20].…”

Section: Dataset Experiments Setup and Evaluation Metricsmentioning

confidence: 99%

Audio Tagging With Connectionist Temporal Classification Model Using Sequentially Labelled Data

Hou

Kong

2019

Lecture Notes in Electrical Engineering

Self Cite

View full text Add to dashboard Cite

Audio tagging aims to predict one or several labels in an audio clip. Many previous works use weakly labelled data (WLD) for audio tagging, where only presence or absence of sound events is known, but the order of sound events is unknown. To use the order information of sound events, we propose sequential labelled data (SLD), where both the presence or absence and the order information of sound events are known. To utilize SLD in audio tagging, we propose a Convolutional Recurrent Neural Network followed by a Connectionist Temporal Classification (CRNN-CTC) objective function to map from an audio clip spectrogram to SLD. Experiments show that CRNN-CTC obtains an Area Under Curve (AUC) score of 0.986 in audio tagging, outperforming the baseline CRNN of 0.908 and 0.815 with Max Pooling and Average Pooling, respectively. In addition, we show CRNN-CTC has the ability to predict the order of sound events in an audio clip.

show abstract

“…Humans have an inherent ability to match sound events based on acoustic similarity and the relationship between them [1]. Previous studies mainly focus on sound event detection (SED), investigating which sound events happen in an audio recording and when they occur [2]. In contrast, Sound event retrieval (SER) is retrieving audio recordings that are similar to a given input audio query [3,4].…”

Section: Introductionmentioning

confidence: 99%

Multi-Label Sound Event Retrieval Using A Deep Learning-Based Siamese Structure With A Pairwise Presence Matrix

Fan

Nichols

Tompkins

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Realistic recordings of soundscapes often have multiple sound events co-occurring, such as car horns, engine and human voices. Sound event retrieval is a type of contentbased search aiming at finding audio samples, similar to an audio query based on their acoustic or semantic content. State of the art sound event retrieval models have focused on single-label audio recordings, with only one sound event occurring, rather than on multi-label audio recordings (i.e., multiple sound events occur in one recording). To address this latter problem, we propose different Deep Learning architectures with a Siamesestructure and a Pairwise Presence Matrix. The networks are trained and evaluated using the SONYC-UST dataset containing both single-and multi-label soundscape recordings. The performance results show the effectiveness of our proposed model.

show abstract

Sound Event Detection and Time–Frequency Segmentation from Weakly Labelled Data

Cited by 97 publications

References 44 publications

Audio Caption: Listen and Tell

Audio Caption: Listen and Tell

Audio Tagging With Connectionist Temporal Classification Model Using Sequentially Labelled Data

Multi-Label Sound Event Retrieval Using A Deep Learning-Based Siamese Structure With A Pairwise Presence Matrix

Contact Info

Product

Resources

About