Using multi-stream hierarchical deep neural network to extract deep audio feature for acoustic event detection

Li, Yanxiong; Xue, Zhang; Jin, Hai; Li, Xianku; Wang, Qin; He, Qianhua; Huang, Qian

doi:10.1007/s11042-016-4332-z

Cited by 21 publications

(4 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Chiba et al [61] recently proposed multi-stream attention-based BiL-STM network for speech emotion recognition. Li et al [62] extracted deep features by training multi-stream hierarchical DNN for acoustic event detection. Moreover Sheikh et al [29] found that settings like context frame size optimized for one stuttering class are not good for other stuttering types.…”

Section: Multi-contextual Stutternetmentioning

confidence: 99%

Advancing Stuttering Detection via Data Augmentation, Class-Balanced Loss and Multi-Contextual Deep Learning

Sheikh

Sahidullah

Hirsch³

et al. 2023

IEEE J. Biomed. Health Inform.

View full text Add to dashboard Cite

Stuttering is a neuro-developmental speech impairment characterized by uncontrolled utterances (interjections) and core behaviors (blocks, repetitions, and prolongations), and is caused by the failure of speech sensorimotors. Due to its complex nature, stuttering detection (SD) is a difficult task. If detected at an early stage, it could facilitate speech therapists to observe and rectify the speech patterns of persons who stutter (PWS). The stuttered speech of PWS is usually available in limited amounts and is highly imbalanced. To this end, we address the class imbalance problem in the SD domain via a multibranching (MB) scheme and by weighting the contribution of classes in the overall loss function, resulting in a huge improvement in stuttering classes on the SEP-28k dataset over the baseline (StutterNet). To tackle data scarcity, we investigate the effectiveness of data augmentation on top of a multi-branched training scheme. The augmented training outperforms the MB StutterNet (clean) by a relative margin of 4.18% in macro F1-score (F 1 ). In addition, we propose a multi-contextual (MC) StutterNet, which exploits different contexts of the stuttered speech, resulting in an overall improvement of 4.48% in F 1 over the single context based MB StutterNet. Finally, we have shown that applying data augmentation in the cross-corpora scenario can improve the overall SD performance by a relative margin of 13.23% in F 1 over the clean training.

show abstract

Section: Multi-contextual Stutternetmentioning

confidence: 99%

Advancing Stuttering Detection via Data Augmentation, Class-Balanced Loss and Multi-Contextual Deep Learning

Sheikh

Sahidullah

Hirsch³

et al. 2023

IEEE J. Biomed. Health Inform.

View full text Add to dashboard Cite

show abstract

“…Because of the development of artificial intelligence (AI) and deep learning (DL), deep features of audio data are widely studied and used in many audio-based applications, such as acoustic scene classification [ 77 ][ 78 ], audio/video analysis [ 79 ], and speaker recognition [ 80 ], since 2010.…”

Section: Evolution Of Audio Featuresmentioning

confidence: 99%

A Large-Scale UAV Audio Dataset and Audio-Based UAV Classification Using CNN

Wang

Chu

et al. 2022

2022 Sixth IEEE International Conference on Robotic Computing (IRC)

View full text Add to dashboard Cite

“…In addition, the common hand-crafted features used for acoustic scene classification (or clustering) include the logarithm mel-band energy, mel frequency cepstral coefficient (MFCC), spectral flux, spectrogram, Gabor filterbank, cochleograms, I-vector, histogram of gradients features [12]- [15], the histogram of gradients of timefrequency representations (HGTR) [14], hash features [16], and local binary patterns [17], [18]. In recent years, some transformed features using matrix factorization [19], [20] and deep neural network [6], [11], [21], are used to address the lack of flexibility of hand-crafted features. Hand-crafted or shallow features did not effectively represent the property differences among various classes of acoustic scenes, and thus their performance was inferior to that of deep transformed features learned by deep neural networks, such as convolutional neural network (CNN) [11], [22]- [25].…”

Section: Introductionmentioning

confidence: 99%

Domestic Activities Clustering From Audio Recordings Using Convolutional Capsule Autoencoder Network

Lin

Huang

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Recent efforts have been made on domestic activities classification from audio recordings, especially the works submitted to the challenge of DCASE (Detection and Classification of Acoustic Scenes and Events) since 2018. In contrast, few studies were done on domestic activities clustering, which is a newly emerging problem. Domestic activities clustering from audio recordings aims at merging audio clips which belong to the same class of domestic activity into a single cluster. Domestic activities clustering is an effective way for unsupervised estimation of daily activities performed in home environment. In this study, we propose a method for domestic activities clustering using a convolutional capsule autoencoder network (CCAN). In the method, the deep embeddings are learned by the autoencoder in the CCAN, while the deep embeddings which belong to the same class of domestic activities are merged into a single cluster by a clustering layer in the CCAN. Evaluated on a public dataset adopted in DCASE-2018 Task 5, the results show that the proposed method outperforms state-of-the-art methods in terms of the metrics of clustering accuracy and normalized mutual information.

show abstract

Using multi-stream hierarchical deep neural network to extract deep audio feature for acoustic event detection

Cited by 21 publications

References 35 publications

Advancing Stuttering Detection via Data Augmentation, Class-Balanced Loss and Multi-Contextual Deep Learning

Advancing Stuttering Detection via Data Augmentation, Class-Balanced Loss and Multi-Contextual Deep Learning

A Large-Scale UAV Audio Dataset and Audio-Based UAV Classification Using CNN

Domestic Activities Clustering From Audio Recordings Using Convolutional Capsule Autoencoder Network

Contact Info

Product

Resources

About