2017
DOI: 10.1007/978-3-319-53753-5_4
|View full text |Cite
|
Sign up to set email alerts
|

Exploring Convolutional Neural Networks for Voice Activity Detection

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
8
0

Year Published

2017
2017
2021
2021

Publication Types

Select...
3
3

Relationship

1
5

Authors

Journals

citations
Cited by 12 publications
(9 citation statements)
references
References 22 publications
0
8
0
Order By: Relevance
“…We compared to the following standard untrained techniques: advanced front-end [2] (ETSI), ITU-T G.729 Annex B [3] (G729B), Likelihood Ratio test [4] (Sohn) and Long Term Spectral Divergence [5] (LTSD). Recent trained methods, producing different models for each noise levels, are also compared: GMM trained on Mel Frequency Cepstral Coefficients [6] (GMM-MFCC), Complete Linkage Clustering [7] (CLC), Multilayer Perceptrons (Segbroeck, Neurogram) [28,29] and a Convolutional Neural Network [8] (CNN). Table 1 compares models trained on all noise levels and evaluated on the Detection Cost Function, DCF = 0.25F AR + 0.75M R, MR and FAR being miss and false alarm rates, extending the table of results from Neurogram [29] and showing substantial improvements.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…We compared to the following standard untrained techniques: advanced front-end [2] (ETSI), ITU-T G.729 Annex B [3] (G729B), Likelihood Ratio test [4] (Sohn) and Long Term Spectral Divergence [5] (LTSD). Recent trained methods, producing different models for each noise levels, are also compared: GMM trained on Mel Frequency Cepstral Coefficients [6] (GMM-MFCC), Complete Linkage Clustering [7] (CLC), Multilayer Perceptrons (Segbroeck, Neurogram) [28,29] and a Convolutional Neural Network [8] (CNN). Table 1 compares models trained on all noise levels and evaluated on the Detection Cost Function, DCF = 0.25F AR + 0.75M R, MR and FAR being miss and false alarm rates, extending the table of results from Neurogram [29] and showing substantial improvements.…”
Section: Resultsmentioning
confidence: 99%
“…The results shown in fig.1 compare our SNNs with the protocol previously mentioned evaluated with the Half Total Error Rate: HT ER = 0.5M R + 0.5F AR. The proposed one-hidden layer model, SNN h1, scores 4.6%, 12.4% and 25.2% for low, medium and high noise levels, respectively, being able to compete with trained models and outperform- [8] CLC [7] GMM-MFCC [6] LTSD [5] Sohn [4] G729B [3] ETSI [2] ing the standard untrained models. The second model proposed, with two hidden layers and different membrane time constants, SNN h2, scores 6.7%, 12.0% and 22.7%, achieving slightly better performances in medium and high noise scenarios but losing performance on low noise levels.…”
Section: Resultsmentioning
confidence: 99%
“…It is important to remark that Fourier analysis is widely applied together with ANNs for decades and in different applications, such as brain electroencephalogram processing [9], cardiovascular analysis [10], speech processing [11], seismic analysis [12] and face spoofing detection [8]. Besides, it is also explored with deep learning, as, for example, in the speech processing field [13], [14], where an audio signal is converted into a 2D spectrogram image.…”
Section: Introductionmentioning
confidence: 99%
“…The tasks of detecting human speech and the speaker position are respectively referred as Voice Activity Detection (VAD) and Speaker LOCalization (SLOC). In the research community, both deserve much attention, finding applications in audio surveillance, human hearing modelling, speech enhancement, human and robot interaction and so forth (Hughes & Mierle, 2013;Silva, Stuchi, Violato & Cuozzo, 2017;Tachioka, Narita, Watanabe & Le Roux, 2014;Taghizadeh, Garner, Bourlard, Abutalebi & Asaei, 2011;. In the literature, speaker detection and its localization are generally treated as two separated problems.…”
Section: Introductionmentioning
confidence: 99%
“…For the multi-room domestic scenario numerous DNN-based VAD are discussed in (Ferroni, Bonfigli, Principi, Squartini & Piazza, 2015), where a Deep Belief Network achieves the highest accuracy compared to a Multi Layer Perceptron (MLP) and a Bidirectional Long Short-Term Memory (BLSTM) recurrent neural network. Furthermore, convolutional neural networks (CNNs) directly process the audio spectrogram in (Silva, Stuchi, Violato & Cuozzo, 2017), outperforming the state-of-the-art VADs. Similarly, the magnitude of the audio spectrogram is employed as input feature in (Tashev & Mirsamadi, 2016), where an MLP-based VAD is proposed.…”
Section: Introductionmentioning
confidence: 99%