Self-Adaptive Soft Voice Activity Detection Using Deep Neural Networks for Robust Speaker Verification

Jung, Yongju; Choi, Yeunju; Kim, Hoirin

doi:10.1109/asru46091.2019.9003935

Cited by 16 publications

(19 citation statements)

References 22 publications

(43 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The pooled vector is passed to one or few fully-connected (FC) layers to generate the deep speaker embedding z. The works in [14], [19], [20], [33] are examples of this approach.…”

Section: Deep Speaker Embedding Learningmentioning

confidence: 99%

“…To improve the robustness of the SV model to long nonspeech segments, we proposed self-adaptive soft VAD (SAS-VAD) [33], which is the combination of soft VAD and selfadaptive VAD. Here, we introduce the advanced version of SAS-VAD which shows better performance than the original one and can be combined with the MSA to achieve our ultimate goal.…”

Section: Self-adaptive Soft Voice Activity Detectionmentioning

confidence: 99%

“…In this subsection, we explain our previous soft VAD [33] first and its advanced version later. Unlike typical VADs that make a hard decision on acoustic features with a predefined threshold, the soft VAD makes a soft decision on speaker feature vectors when the self-attentive pooling (SAP) [30] is applied.…”

Section: A Soft Vadmentioning

confidence: 99%

“…(2) They contain audio recordings which already have small portion of non-speech. However, our previous work [33] shows the need of the robust VAD for SV in real-world environments, where the input audio contains long non-speech segments in noisy and reverberant environments. In these adverse environments, the energy-based VAD produces unreliable speech frames, which degrades the performance of SV systems [34].…”

Section: Introductionmentioning

confidence: 99%

“…To satisfy these two requirements for TI-SV, which is our ultimate goal in this paper, we present our methods: feature pyramid module (FPM)-based multi-scale aggregation (MSA) [22] and self-adaptive soft VAD (SAS-VAD) [33]. We employ the FPM-based MSA to deal with short speech segments.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments

et al. 2020

Self Cite

View full text Add to dashboard Cite

Speaker verification (SV) has recently attracted considerable research interest due to the growing popularity of virtual assistants. At the same time, there is an increasing requirement for an SV system: it should be robust to short speech segments, especially in noisy and reverberant environments. In this paper, we consider one more important requirement for practical applications: the system should be robust to an audio stream containing long non-speech segments, where a voice activity detection (VAD) is not applied. To meet these two requirements, we introduce feature pyramid module (FPM)-based multiscale aggregation (MSA) and self-adaptive soft VAD (SAS-VAD). We present the FPM-based MSA to deal with short speech segments in noisy and reverberant environments. Also, we use the SAS-VAD to increase the robustness to long non-speech segments. To further improve the robustness to acoustic distortions (i.e., noise and reverberation), we apply a masking-based speech enhancement (SE) method. We combine SV, VAD, and SE models in a unified deep learning framework and jointly train the entire network in an endto-end manner. To the best of our knowledge, this is the first work combining these three models in a deep learning framework. We conduct experiments on Korean indoor (KID) and VoxCeleb datasets, which are corrupted by noise and reverberation. The results show that the proposed method is effective for SV in the challenging conditions and performs better than the baseline i-vector and deep speaker embedding systems.

show abstract

“…The pooled vector is passed to one or few fully-connected (FC) layers to generate the deep speaker embedding z. The works in [14], [19], [20], [33] are examples of this approach.…”

Section: Deep Speaker Embedding Learningmentioning

confidence: 99%

Section: Self-adaptive Soft Voice Activity Detectionmentioning

confidence: 99%