Spatial Pyramid Encoding with Convex Length Normalization for Text-Independent Speaker Verification

Jung, Yongju; Kim, Younggwan; Lim, Hyungjun; Choi, Yeunju; Kim, Hoirin

doi:10.21437/interspeech.2019-2177

Cited by 28 publications

(28 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The TDNN is composed of 1D convolution and fully connected layers. Several later studies [16,19,21,22,25,27] suggest replacing the TDNN with variants of ResNet34, composed of 2D convolutions, as the encoder network. We used TDNN and ResNet34 for x-vector and LDE embeddings, respectively.…”

Section: Encoder Networkmentioning

confidence: 99%

“…Villalba et al summarized several state-of-the-art speaker recognition systems for the NIST SRE18 Challenge [16], where x-vector based systems [17] consistently outperformed i-vector based systems [18]. There has also been a surge of interest in new encoding methods and endto-end loss functions for speaker recognition [19,20,21,22,23,24,25]. One prominent advancement is the use of learnable dictionary encoding (LDE) [19] and angular softmax [20] for speaker recognition, which are reported to boost the speaker recognition performance on open-source corpora such as the VoxCelebs [26,27].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings

Cooper

Lai

Yasuda

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

143

125

View full text Add to dashboard Cite

While speaker adaptation for end-to-end speech synthesis using speaker embeddings can produce good speaker similarity for speakers seen during training, there remains a gap for zero-shot adaptation to unseen speakers. We investigate multi-speaker modeling for end-to-end text-to-speech synthesis and study the effects of different types of state-of-the-art neural speaker embeddings on speaker similarity for unseen speakers. Learnable dictionary encoding-based speaker embeddings with angular softmax loss can improve equal error rates over x-vectors in a speaker verification task; these embeddings also improve speaker similarity and naturalness for unseen speakers when used for zero-shot adaptation to new speakers in endto-end speech synthesis.

show abstract

Section: Encoder Networkmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings

Cooper

Lai

Yasuda

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

143

125

View full text Add to dashboard Cite

show abstract

“…The first module is the frame-level feature extractor which takes a sequence of acoustic features x t and outputs corresponding speaker features h t (t = 1, · · · , T ). In our system, ResNet [19] is used as a feature extractor, which has been widely used in previous studies [20,21,22]. The architecture is described in Table 1.…”

Section: Attention-based Soft Vad With the Sv Systemmentioning

confidence: 99%

“…The minibatch size is 64, and the weight decay parameter is 0.0001. We use the same learning rate schedule as in [22] with the initial learning rate of 0.1.…”

Section: Experimental Setups For Speaker Verificationmentioning

confidence: 99%

Self-Adaptive Soft Voice Activity Detection Using Deep Neural Networks for Robust Speaker Verification

Jung¹,

Choi²,

Kim³

2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

Self Cite

View full text Add to dashboard Cite

Voice activity detection (VAD), which classifies frames as speech or non-speech, is an important module in many speech applications including speaker verification. In this paper, we propose a novel method, called self-adaptive soft VAD, to incorporate a deep neural network (DNN)-based VAD into a deep speaker embedding system. The proposed method is a combination of the following two approaches. The first approach is soft VAD, which performs a soft selection of frame-level features extracted from a speaker feature extractor. The frame-level features are weighted by their corresponding speech posteriors estimated from the DNN-based VAD, and then aggregated to generate a speaker embedding. The second approach is self-adaptive VAD, which fine-tunes the pre-trained VAD on the speaker verification data to reduce the domain mismatch. Here, we introduce two unsupervised domain adaptation (DA) schemes, namely speech posteriorbased DA (SP-DA) and joint learning-based DA (JL-DA). Experiments on a Korean speech database demonstrate that the verification performance is improved significantly in realworld environments by using self-adaptive soft VAD.

show abstract

“…This assumes that voice activity detection (VAD) has not been applied to remove non-speech frames, which may degrade the SV performance, from the audio. Most SV studies still rely on a traditional energy-based VAD [13], [20], [30], and even some of them do not apply VAD [31], [32]. It is because most SV databases are included in the following two cases, thus minimizing the need of robust VAD: (1) They were recorded in relatively clean conditions, where the naive energy-based VAD performs reasonably well.…”

Section: Introductionmentioning

confidence: 99%

A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments

et al. 2020

Self Cite

View full text Add to dashboard Cite

Speaker verification (SV) has recently attracted considerable research interest due to the growing popularity of virtual assistants. At the same time, there is an increasing requirement for an SV system: it should be robust to short speech segments, especially in noisy and reverberant environments. In this paper, we consider one more important requirement for practical applications: the system should be robust to an audio stream containing long non-speech segments, where a voice activity detection (VAD) is not applied. To meet these two requirements, we introduce feature pyramid module (FPM)-based multiscale aggregation (MSA) and self-adaptive soft VAD (SAS-VAD). We present the FPM-based MSA to deal with short speech segments in noisy and reverberant environments. Also, we use the SAS-VAD to increase the robustness to long non-speech segments. To further improve the robustness to acoustic distortions (i.e., noise and reverberation), we apply a masking-based speech enhancement (SE) method. We combine SV, VAD, and SE models in a unified deep learning framework and jointly train the entire network in an endto-end manner. To the best of our knowledge, this is the first work combining these three models in a deep learning framework. We conduct experiments on Korean indoor (KID) and VoxCeleb datasets, which are corrupted by noise and reverberation. The results show that the proposed method is effective for SV in the challenging conditions and performs better than the baseline i-vector and deep speaker embedding systems.

show abstract

Spatial Pyramid Encoding with Convex Length Normalization for Text-Independent Speaker Verification

Cited by 28 publications

References 35 publications

Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings

Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings

Self-Adaptive Soft Voice Activity Detection Using Deep Neural Networks for Robust Speaker Verification

A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments

Contact Info

Product

Resources

About