Joint Learning Using Denoising Variational Autoencoders for Voice Activity Detection

Jung, Yongju; Kim, Younggwan; Choi, Yeunju; Kim, Hoirin

doi:10.21437/interspeech.2018-1151

Cited by 31 publications

(23 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The second one is to use forced-alignment automatic speech recognition (ASR) [57]. The third one is to apply unsupervised VAD to the clean data and use the results as the labels of the corresponding noisy data [58]- [60]. Note that the last method requires parallel clean and noisy data.…”

Section: B Self-adaptive Vadmentioning

confidence: 99%

“…It has been shown that this class imbalance in training can degrade the performance of deep learningbased classifiers in various domains [62]. To address the problem, many VAD studies insert silence at the beginning and end of each utterance to increase the ratio of non-speech frames [58]- [60], [63]. Unlike this heuristic approach, in [64], we proposed to use the focal loss, which was originally designed to address class imbalance in object detection task.…”

Section: B Self-adaptive Vadmentioning

confidence: 99%

“…For VAD, we used the same data setup as in [60], where noisy data are generated by corrupting the clean utterances of the Aurora4 [69] with noise. For training, we inserted 2 s of silence at the beginning and end of the utterance to address the speech/non-speech class imbalance.…”

Section: Experimental Setup a Datasetsmentioning

confidence: 99%

See 2 more Smart Citations

A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments

et al. 2020

Self Cite

View full text Add to dashboard Cite

Speaker verification (SV) has recently attracted considerable research interest due to the growing popularity of virtual assistants. At the same time, there is an increasing requirement for an SV system: it should be robust to short speech segments, especially in noisy and reverberant environments. In this paper, we consider one more important requirement for practical applications: the system should be robust to an audio stream containing long non-speech segments, where a voice activity detection (VAD) is not applied. To meet these two requirements, we introduce feature pyramid module (FPM)-based multiscale aggregation (MSA) and self-adaptive soft VAD (SAS-VAD). We present the FPM-based MSA to deal with short speech segments in noisy and reverberant environments. Also, we use the SAS-VAD to increase the robustness to long non-speech segments. To further improve the robustness to acoustic distortions (i.e., noise and reverberation), we apply a masking-based speech enhancement (SE) method. We combine SV, VAD, and SE models in a unified deep learning framework and jointly train the entire network in an endto-end manner. To the best of our knowledge, this is the first work combining these three models in a deep learning framework. We conduct experiments on Korean indoor (KID) and VoxCeleb datasets, which are corrupted by noise and reverberation. The results show that the proposed method is effective for SV in the challenging conditions and performs better than the baseline i-vector and deep speaker embedding systems.

show abstract

Section: B Self-adaptive Vadmentioning

confidence: 99%

Section: B Self-adaptive Vadmentioning

confidence: 99%

See 1 more Smart Citation

A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments

et al. 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…For VAD, we use the same data setup as in [11]. To construct the 35 hours training set, the clean training set of the Au-rora4 database [25] is used.…”

Section: Experimental Setups For Vadmentioning

confidence: 99%

Self-Adaptive Soft Voice Activity Detection Using Deep Neural Networks for Robust Speaker Verification

Jung¹,

Choi²,

Kim³

2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

Self Cite

View full text Add to dashboard Cite

Voice activity detection (VAD), which classifies frames as speech or non-speech, is an important module in many speech applications including speaker verification. In this paper, we propose a novel method, called self-adaptive soft VAD, to incorporate a deep neural network (DNN)-based VAD into a deep speaker embedding system. The proposed method is a combination of the following two approaches. The first approach is soft VAD, which performs a soft selection of frame-level features extracted from a speaker feature extractor. The frame-level features are weighted by their corresponding speech posteriors estimated from the DNN-based VAD, and then aggregated to generate a speaker embedding. The second approach is self-adaptive VAD, which fine-tunes the pre-trained VAD on the speaker verification data to reduce the domain mismatch. Here, we introduce two unsupervised domain adaptation (DA) schemes, namely speech posteriorbased DA (SP-DA) and joint learning-based DA (JL-DA). Experiments on a Korean speech database demonstrate that the verification performance is improved significantly in realworld environments by using self-adaptive soft VAD.

show abstract

“…Typically, SAD methods extract various features from the waveform that are, for example, related to energy or zero-crossing rate [20,21,39,40], harmonicity and pitch [41][42][43], formant structure [20,24,44,45], degree of stationarity of speech and noise [46][47][48], modulation [49][50][51], or Mel-frequency cepstral coefficients (MFCCs) [24]. Feature extraction is subsequently followed by traditional statistical modeling or, more recently, by deep learning-based classifiers, for example, deep neural networks (DNNs) [52,53], recurrent ones [54,55], or convolutional neural networks (CNNs) [56][57][58], often in conjunction with autoencoders [59]. Further, end-to-end deep learning approaches applied directly to the raw signal have also been proposed [60].…”

Section: Related Workmentioning

confidence: 99%

Room-localized speech activity detection in multi-microphone smart homes

Giannoulis

Potamianos

Maragos

2019

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Voice-enabled interaction systems in domestic environments have attracted significant interest recently, being the focus of smart home research projects and commercial voice assistant home devices. Within the multi-module pipelines of such systems, speech activity detection (SAD) constitutes a crucial component, providing input to their activation and speech recognition subsystems. In typical multi-room domestic environments, SAD may also convey spatial intelligence to the interaction, in addition to its traditional temporal segmentation output, by assigning speech activity at the room level. Such room-localized SAD can, for example, disambiguate user command referents, allow localized system feedback, and enable parallel voice interaction sessions by multiple subjects in different rooms. In this paper, we investigate a room-localized SAD system for smart homes equipped with multiple microphones distributed in multiple rooms, significantly extending our earlier work. The system employs a two-stage algorithm, incorporating a set of hand-crafted features specially designed to discriminate room-inside vs. room-outside speech at its second stage, refining SAD hypotheses obtained at its first stage by traditional statistical modeling and acoustic front-end processing. Both algorithmic stages exploit multi-microphone information, combining it at the signal, feature, or decision level. The proposed approach is extensively evaluated on both simulated and real data recorded in a multi-room, multi-microphone smart home, significantly outperforming alternative baselines. Further, it remains robust to reduced microphone setups, while also comparing favorably to deep learning-based alternatives.

show abstract

Joint Learning Using Denoising Variational Autoencoders for Voice Activity Detection

Cited by 31 publications

References 14 publications

A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments

A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments

Self-Adaptive Soft Voice Activity Detection Using Deep Neural Networks for Robust Speaker Verification

Room-localized speech activity detection in multi-microphone smart homes

Contact Info

Product

Resources

About