End-to-end Language Identification using NetFV and NetVLAD

Chen, Jinkun; Cai, Weicheng; Cai, Danwei; Cai, Zexin; Zhong, Haibin; Li, Ming

doi:10.1109/iscslp.2018.8706687

Cited by 14 publications

(11 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The final pooling strategy of interest is the Learnable Dictionary Encoding (LDE) proposed by [14,15]. This method is closely based on the NetVLAD layer [16,23] designed for image retrieval.…”

Section: Related Workmentioning

confidence: 99%

Utterance-level Aggregation for Speaker Recognition in the Wild

Xie

Nagrani

Chung

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

276

259

View full text Add to dashboard Cite

The objective of this paper is speaker recognition 'in the wild' -where utterances may be of variable length and also contain irrelevant signals. Crucial elements in the design of deep networks for this task are the type of trunk (frame level) network, and the method of temporal aggregation. We propose a powerful speaker recognition deep network, using a 'thin-ResNet' trunk architecture, and a dictionary-based NetVLAD or GhostVLAD layer to aggregate features across time, that can be trained end-to-end. We show that our network achieves state of the art performance by a significant margin on the VoxCeleb1 test set for speaker recognition, whilst requiring fewer parameters than previous methods. We also investigate the effect of utterance length on performance, and conclude that for 'in the wild' data, a longer length is beneficial.

show abstract

Section: Related Workmentioning

confidence: 99%

Utterance-level Aggregation for Speaker Recognition in the Wild

Xie

Nagrani

Chung

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

276

259

View full text Add to dashboard Cite

show abstract

“…For example, Geng et al [14] investigate the use of RNNs for temporal aggregation in language identification. Cai et al [15] explore the encoder and loss function for LID and propose some efficient temporal aggregation strategies, while Chen et al [16] use NetVLAD [17] for temporal aggregation. In more recent work [18] use a 2D CNN as feature extractor with a BLSTM backend for temporal modelling and a self-attentive pooling layer for utterance level aggregation.…”

Section: Related Workmentioning

confidence: 99%

“…NetVLAD. We also consider NetVLAD [17], which has been successfully used for temporally aggregating features in speech models for LID [16] and speaker verification [42]. NetVLAD mimics the BoW-derived VLAD [43] descriptor by learning a feature vocabulary from the input representations, then softquantising them over this dictionary and finally aggregating the results (in our case temporally).…”

Section: Self-attentive Pooling (Sap)mentioning

confidence: 99%

Now You’re Speaking My Language: Visual Language Identification

Afouras¹,

Chung²,

Zisserman³

2020

Interspeech 2020

View full text Add to dashboard Cite

The goal of this work is to train models that can identify a spoken language just by interpreting the speaker's lip movements. Our contributions are the following: (i) we show that models can learn to discriminate among 14 different languages using only visual speech information; (ii) we compare different designs in sequence modelling and utterance-level aggregation in order to determine the best architecture for this task; (iii) we investigate the factors that contribute discriminative cues and show that our model indeed solves the problem by finding temporal patterns in mouth movements and not by exploiting spurious correlations. We demonstrate this further by evaluating our models on challenging examples from bilingual speakers.

show abstract

“…The most common one is the average pooling layer, which aggregates the statistics (i.e., mean, or mean and standard deviation) over the whole utterance [3,4]. Self-attentive pooling layer [31], learnable dictionary encoding (LDE) layer [32], dictionary-based NetVLAD layer [33,34] also have been proposed to serve the encoding layers. The utterance level representation after the encoding layer is further processed through a fully connected layer followed by a speaker classifier.…”

Section: Deep Speaker Embeddingmentioning

confidence: 99%

Multi-Channel Training for End-to-End Speaker Recognition Under Reverberant and Noisy Environment

Cai

Qin

2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

Despite the significant improvements in speaker recognition enabled by deep neural networks, unsatisfactory performance persists under far-field scenarios due to the effects of the long range fading, room reverberation, and environmental noises. In this study, we focus on far-field speaker recognition with a microphone array. We propose a multi-channel training framework for the deep speaker embedding neural network on noisy and reverberant data. The proposed multi-channel training framework simultaneously processes the time-, frequency-and channelinformation to learn a robust deep speaker embedding. Based on the 2-dimensional or 3-dimensional convolution layer, we investigate different multi-channel training schemes. Experiments on the simulated multi-channel reverberant and noisy data show that the proposed method obtains significant improvements over the single-channel trained deep speaker embedding system with front end speech enhancement or multichannel embedding fusion.

show abstract

End-to-end Language Identification using NetFV and NetVLAD

Cited by 14 publications

References 32 publications

Utterance-level Aggregation for Speaker Recognition in the Wild

Utterance-level Aggregation for Speaker Recognition in the Wild

Now You’re Speaking My Language: Visual Language Identification

Multi-Channel Training for End-to-End Speaker Recognition Under Reverberant and Noisy Environment

Contact Info

Product

Resources

About