Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1151
|View full text |Cite
|
Sign up to set email alerts
|

Joint Learning Using Denoising Variational Autoencoders for Voice Activity Detection

Abstract: Voice activity detection (VAD) is a challenging task in very low signal-to-noise ratio (SNR) environments. To address this issue, a promising approach is to map noisy speech features to corresponding clean features and to perform VAD using the generated clean features. This can be implemented by concatenating a speech enhancement (SE) and a VAD network, whose parameters are jointly updated. In this paper, we propose denoising variational autoencoder-based (DVAE) speech enhancement in the joint learning framewo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
23
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
6
1
1

Relationship

2
6

Authors

Journals

citations
Cited by 31 publications
(23 citation statements)
references
References 14 publications
0
23
0
Order By: Relevance
“…The second one is to use forced-alignment automatic speech recognition (ASR) [57]. The third one is to apply unsupervised VAD to the clean data and use the results as the labels of the corresponding noisy data [58]- [60]. Note that the last method requires parallel clean and noisy data.…”
Section: B Self-adaptive Vadmentioning
confidence: 99%
See 2 more Smart Citations
“…The second one is to use forced-alignment automatic speech recognition (ASR) [57]. The third one is to apply unsupervised VAD to the clean data and use the results as the labels of the corresponding noisy data [58]- [60]. Note that the last method requires parallel clean and noisy data.…”
Section: B Self-adaptive Vadmentioning
confidence: 99%
“…It has been shown that this class imbalance in training can degrade the performance of deep learningbased classifiers in various domains [62]. To address the problem, many VAD studies insert silence at the beginning and end of each utterance to increase the ratio of non-speech frames [58]- [60], [63]. Unlike this heuristic approach, in [64], we proposed to use the focal loss, which was originally designed to address class imbalance in object detection task.…”
Section: B Self-adaptive Vadmentioning
confidence: 99%
See 1 more Smart Citation
“…For VAD, we use the same data setup as in [11]. To construct the 35 hours training set, the clean training set of the Au-rora4 database [25] is used.…”
Section: Experimental Setups For Vadmentioning
confidence: 99%
“…Typically, SAD methods extract various features from the waveform that are, for example, related to energy or zero-crossing rate [20,21,39,40], harmonicity and pitch [41][42][43], formant structure [20,24,44,45], degree of stationarity of speech and noise [46][47][48], modulation [49][50][51], or Mel-frequency cepstral coefficients (MFCCs) [24]. Feature extraction is subsequently followed by traditional statistical modeling or, more recently, by deep learning-based classifiers, for example, deep neural networks (DNNs) [52,53], recurrent ones [54,55], or convolutional neural networks (CNNs) [56][57][58], often in conjunction with autoencoders [59]. Further, end-to-end deep learning approaches applied directly to the raw signal have also been proposed [60].…”
Section: Related Workmentioning
confidence: 99%