2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP) 2018
DOI: 10.1109/iscslp.2018.8706687
|View full text |Cite
|
Sign up to set email alerts
|

End-to-end Language Identification using NetFV and NetVLAD

Abstract: In this paper, we apply the NetFV and NetVLAD layers for the end-to-end language identification task. NetFV and NetVLAD layers are the differentiable implementations of the standard Fisher Vector and Vector of Locally Aggregated Descriptors (VLAD) methods, respectively. Both of them can encode a sequence of feature vectors into a fixed dimensional vector which is very important to process those variable-length utterances. We first present the relevances and differences between the classical i-vector and the af… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
6
2

Relationship

3
5

Authors

Journals

citations
Cited by 14 publications
(11 citation statements)
references
References 32 publications
0
11
0
Order By: Relevance
“…The final pooling strategy of interest is the Learnable Dictionary Encoding (LDE) proposed by [14,15]. This method is closely based on the NetVLAD layer [16,23] designed for image retrieval.…”
Section: Related Workmentioning
confidence: 99%
“…The final pooling strategy of interest is the Learnable Dictionary Encoding (LDE) proposed by [14,15]. This method is closely based on the NetVLAD layer [16,23] designed for image retrieval.…”
Section: Related Workmentioning
confidence: 99%
“…For example, Geng et al [14] investigate the use of RNNs for temporal aggregation in language identification. Cai et al [15] explore the encoder and loss function for LID and propose some efficient temporal aggregation strategies, while Chen et al [16] use NetVLAD [17] for temporal aggregation. In more recent work [18] use a 2D CNN as feature extractor with a BLSTM backend for temporal modelling and a self-attentive pooling layer for utterance level aggregation.…”
Section: Related Workmentioning
confidence: 99%
“…NetVLAD. We also consider NetVLAD [17], which has been successfully used for temporally aggregating features in speech models for LID [16] and speaker verification [42]. NetVLAD mimics the BoW-derived VLAD [43] descriptor by learning a feature vocabulary from the input representations, then softquantising them over this dictionary and finally aggregating the results (in our case temporally).…”
Section: Self-attentive Pooling (Sap)mentioning
confidence: 99%
“…The most common one is the average pooling layer, which aggregates the statistics (i.e., mean, or mean and standard deviation) over the whole utterance [3,4]. Self-attentive pooling layer [31], learnable dictionary encoding (LDE) layer [32], dictionary-based NetVLAD layer [33,34] also have been proposed to serve the encoding layers. The utterance level representation after the encoding layer is further processed through a fully connected layer followed by a speaker classifier.…”
Section: Deep Speaker Embeddingmentioning
confidence: 99%