2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018
DOI: 10.1109/icassp.2018.8462025
|View full text |Cite
|
Sign up to set email alerts
|

A Novel Learnable Dictionary Encoding Layer for End-to-End Language Identification

Abstract: A novel learnable dictionary encoding layer is proposed in this paper for end-to-end language identification. It is inline with the conventional GMM i-vector approach both theoretically and practically. We imitate the mechanism of traditional GMM training and Supervector encoding procedure on the top of CNN. The proposed layer can accumulate high-order statistics from variable-length input sequence and generate an utterance level fixed-dimensional vector representation. Unlike the conventional methods, our new… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
48
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
3
2

Relationship

2
7

Authors

Journals

citations
Cited by 64 publications
(48 citation statements)
references
References 23 publications
0
48
0
Order By: Relevance
“…Self-attentive pooling layer [20], learnable dictionary encoding layer [21], and dictionary-based NetVLAD layer [22,23] are other commonly used encoding layers. Once the utterance-level representation is extracted, a fully connected layer and a speaker classifier are employed to further abstract the speaker representation and classify the training speakers.…”
Section: Revisit: Deep Speaker Embeddingmentioning
confidence: 99%
“…Self-attentive pooling layer [20], learnable dictionary encoding layer [21], and dictionary-based NetVLAD layer [22,23] are other commonly used encoding layers. Once the utterance-level representation is extracted, a fully connected layer and a speaker classifier are employed to further abstract the speaker representation and classify the training speakers.…”
Section: Revisit: Deep Speaker Embeddingmentioning
confidence: 99%
“…On the other hand, traditional methods for speaker and language identification such as i-vector systems have explored the use of statistical or dictionary-based methods for aggregation. A number of recent works have proposed to bring similar methods to deep speaker recognition [11,12,13,14,15] (described in Sec. 1.1).…”
Section: Introductionmentioning
confidence: 99%
“…We implemented two variants of SENets: SEnet34 with ResNet34 backbone, and SEnet50 with ResNet50 backbone. Mean-Std ResNet: Recent work in speaker recognition [29,30] has demonstrated that ResNet [27] with pooling achieves comparable results as x-vectors [31]. Therefore, we introduced ResNet with pooling for anti-spoofing.…”
Section: Dnn Modelmentioning
confidence: 99%