Exploring Convolutional Neural Networks for Voice Activity Detection

Silva, Diego Augusto Santos; Stuchi, José Augusto; Violato, Ricardo Paranhos Velloso; Cuozzo, Luís Gustavo D.

doi:10.1007/978-3-319-53753-5_4

Cited by 12 publications

(9 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We compared to the following standard untrained techniques: advanced front-end [2] (ETSI), ITU-T G.729 Annex B [3] (G729B), Likelihood Ratio test [4] (Sohn) and Long Term Spectral Divergence [5] (LTSD). Recent trained methods, producing different models for each noise levels, are also compared: GMM trained on Mel Frequency Cepstral Coefficients [6] (GMM-MFCC), Complete Linkage Clustering [7] (CLC), Multilayer Perceptrons (Segbroeck, Neurogram) [28,29] and a Convolutional Neural Network [8] (CNN). Table 1 compares models trained on all noise levels and evaluated on the Detection Cost Function, DCF = 0.25F AR + 0.75M R, MR and FAR being miss and false alarm rates, extending the table of results from Neurogram [29] and showing substantial improvements.…”

Section: Resultsmentioning

confidence: 99%

“…The results shown in fig.1 compare our SNNs with the protocol previously mentioned evaluated with the Half Total Error Rate: HT ER = 0.5M R + 0.5F AR. The proposed one-hidden layer model, SNN h1, scores 4.6%, 12.4% and 25.2% for low, medium and high noise levels, respectively, being able to compete with trained models and outperform- [8] CLC [7] GMM-MFCC [6] LTSD [5] Sohn [4] G729B [3] ETSI [2] ing the standard untrained models. The second model proposed, with two hidden layers and different membrane time constants, SNN h2, scores 6.7%, 12.0% and 22.7%, achieving slightly better performances in medium and high noise scenarios but losing performance on low noise levels.…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Spiking Neural Networks Trained With Backpropagation for Low Power Neuromorphic Implementation of Voice Activity Detection

Martinelli

Dellaferrera

Mainar

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Recent advances in Voice Activity Detection (VAD) are driven by artificial and Recurrent Neural Networks (RNNs), however, using a VAD system in battery-operated devices requires further power efficiency. This can be achieved by neuromorphic hardware, which enables Spiking Neural Networks (SNNs) to perform inference at very low energy consumption. Spiking networks are characterized by their ability to process information efficiently, in a sparse cascade of binary events in time called spikes. However, a big performance gap separates artificial from spiking networks, mostly due to a lack of powerful SNN training algorithms. To overcome this problem we exploit an SNN model that can be recast into an RNN-like model and trained with known deep learning techniques. We describe an SNN training procedure that achieves low spiking activity and pruning algorithms to remove 85% of the network connections with no performance loss. The model achieves state-of-the-art performance with a fraction of power consumption comparing to other methods.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%

Spiking Neural Networks Trained With Backpropagation for Low Power Neuromorphic Implementation of Voice Activity Detection

Martinelli

Dellaferrera

Mainar

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…It is important to remark that Fourier analysis is widely applied together with ANNs for decades and in different applications, such as brain electroencephalogram processing [9], cardiovascular analysis [10], speech processing [11], seismic analysis [12] and face spoofing detection [8]. Besides, it is also explored with deep learning, as, for example, in the speech processing field [13], [14], where an audio signal is converted into a 2D spectrogram image.…”

Section: Introductionmentioning

confidence: 99%

Frequency learning for image classification

Stuchi,

Boccato,

Attux

2020

Preprint

Self Cite

View full text Add to dashboard Cite

Machine learning applied to computer vision and signal processing is achieving results comparable to the human brain on specific tasks due to the great improvements brought by the deep neural networks (DNN). The majority of state-of-the-art architectures nowadays are DNN related, but only a few explore the frequency domain to extract useful information and improve the results, like in the image processing field. In this context, this paper presents a new approach for exploring the Fourier transform of the input images, which is composed of trainable frequency filters that boost discriminative components in the spectrum. Additionally, we propose a slicing procedure to allow the network to learn both global and local features from the frequency-domain representations of the image blocks. The proposed method proved to be competitive with respect to well-known DNN architectures in the selected experiments, with the advantage of being a simpler and lightweight model. This work also raises the discussion on how the state-of-the-art DNNs architectures can exploit not only spatial features, but also the frequency, in order to improve its performance when solving real world problems.

show abstract

“…The tasks of detecting human speech and the speaker position are respectively referred as Voice Activity Detection (VAD) and Speaker LOCalization (SLOC). In the research community, both deserve much attention, finding applications in audio surveillance, human hearing modelling, speech enhancement, human and robot interaction and so forth (Hughes & Mierle, 2013;Silva, Stuchi, Violato & Cuozzo, 2017;Tachioka, Narita, Watanabe & Le Roux, 2014;Taghizadeh, Garner, Bourlard, Abutalebi & Asaei, 2011;. In the literature, speaker detection and its localization are generally treated as two separated problems.…”

Section: Introductionmentioning

confidence: 99%

“…For the multi-room domestic scenario numerous DNN-based VAD are discussed in (Ferroni, Bonfigli, Principi, Squartini & Piazza, 2015), where a Deep Belief Network achieves the highest accuracy compared to a Multi Layer Perceptron (MLP) and a Bidirectional Long Short-Term Memory (BLSTM) recurrent neural network. Furthermore, convolutional neural networks (CNNs) directly process the audio spectrogram in (Silva, Stuchi, Violato & Cuozzo, 2017), outperforming the state-of-the-art VADs. Similarly, the magnitude of the audio spectrogram is employed as input feature in (Tashev & Mirsamadi, 2016), where an MLP-based VAD is proposed.…”

Section: Introductionmentioning

confidence: 99%

Detection of activity and position of speakers by using deep neural networks and acoustic data augmentation

Vecchiotti

Pepe

Principi

et al. 2019

Expert Systems with Applications

View full text Add to dashboard Cite

The task of Speaker LOCalization (SLOC) has been the focus of numerous works in the research field, where SLOC is performed on pure speech data, requiring the presence of an Oracle Voice Activity Detection (VAD) algorithm. Nevertheless, this perfect working condition is not satisfied in a real world scenario, where employed VADs do commit errors. This work addresses this issue with an extensive analysis focusing on the relationship between several datadriven VAD and SLOC models, finally proposing a reliable framework for VAD and SLOC. The effectiveness of the approach here discussed is assessed against a multi-room scenario, which is close to a real world environment. Furthermore, up to the authors' best knowledge, only one contribution proposes a unique framework for VAD and SLOC acting in this addressed scenario; however this solution does not rely on data-driven approaches. This work comes as an extension of the authors' previous research addressing the VAD and SLOC tasks, by proposing numerous advancements to the original neural network architectures. In details, four different models based on convolutional neural networks (CNNs) are here tested, in order to easily highlight the advantages of the introduced novelties. In addition, two different CNN models go under study for SLOC. Furthermore, training of data-driven models is here improved through a specific data augmentation technique. During this procedure, the room impulse responses (RIRs) of two virtual rooms are generated from the knowledge of the room size, reverberation time and microphones and sources placement. Finally, the only other framework for simultaneous detection and localization in a multi-room scenario is here taken into account to fairly compare the proposed method. As result, the proposed method shows to be more accurate than the baseline framework, and remarkable improvements are specially observed when the data

show abstract

Exploring Convolutional Neural Networks for Voice Activity Detection

Cited by 12 publications

References 22 publications

Spiking Neural Networks Trained With Backpropagation for Low Power Neuromorphic Implementation of Voice Activity Detection

Spiking Neural Networks Trained With Backpropagation for Low Power Neuromorphic Implementation of Voice Activity Detection

Frequency learning for image classification

Detection of activity and position of speakers by using deep neural networks and acoustic data augmentation

Contact Info

Product

Resources

About