Study On the Temporal Pooling Used In Deep Neural Networks For Speaker Verification

Rouvier, Mickaël; Bousquet, Pierre-Michel; Duret, Jarod

doi:10.23919/eusipco54536.2021.9616048

Cited by 3 publications

(2 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…where EG[•] is the expectation on G, µ the empirical mean, and ∥ the concatenation. That is, P (g)∈R R•d concatenates the first R moments of g. In speaker verification: [20] shows that 3 rd -4 th moments alone are not useful; [21] uses R=4 for auxiliary tasks. In our case, we feed R=5 moments to the classifier.…”

Section: Our Methodsmentioning

confidence: 99%

Online Continual Learning in Keyword Spotting for Low-Resource Devices via Pooling High-Order Temporal Statistics

Michieli¹,

Parada²,

Özay³

2023

Interspeech 2023

View full text Add to dashboard Cite

Keyword Spotting (KWS) models on embedded devices should adapt fast to new user-defined words without forgetting previous ones. Embedded devices have limited storage and computational resources, thus, they cannot save samples or update large models. We consider the setup of embedded online continual learning (EOCL), where KWS models with frozen backbone are trained to incrementally recognize new words from a non-repeated stream of samples, seen one at a time. To this end, we propose Temporal Aware Pooling (TAP) which constructs an enriched feature space computing high-order moments of speech features extracted by a pre-trained backbone. Our method, TAP-SLDA, updates a Gaussian model for each class on the enriched feature space to effectively use audio representations. In experimental analyses, TAP-SLDA outperforms competitors on several setups, backbones, and baselines, bringing a relative average gain of 11.3% on the GSC dataset.

show abstract

Section: Our Methodsmentioning

confidence: 99%

Online Continual Learning in Keyword Spotting for Low-Resource Devices via Pooling High-Order Temporal Statistics

Michieli¹,

Parada²,

Özay³

2023

Interspeech 2023

View full text Add to dashboard Cite

show abstract

“…It is useful to note that the statistics pooling layer computes the mean and standard deviation vectors of its input before concatenating these values to form a new vector as the output. It was concluded that both mean and standard deviation pooling outperform max pooling in speaker identification and verification [86]. Since speaker identification is similar to language identification in terms of the NN-based model configuration, training, and evaluation, the statistics pooling layer is expected to achieve higher LID performance than the max pooling layer.…”

Section: X-vector Self-attention Lid Modelmentioning

confidence: 99%

Enhancing spoken language identification and diarization for multilingual speech

Liu¹

View full text Add to dashboard Cite

Suzy J. Styles for their insightful and generous comments and suggestions on my research. I wish to extend my thanks to Dr. Justin Dauwels for o↵ering me the scholarship that allows me to pursue this degree. I would also like to thank my team members of the NRF and ISSAC projects for a cherished time spent together. I enjoy our stimulating discussion and collaboration in our meetings. Many thanks to all people at Nanyang Technological University who help me to complete my Ph.D. program. I want to thank my friends, whose warm concern and lovely chat have accompanied me these years. Finally, I would like to express my special thanks to my family, especially my mom, for their consistent love and support. I am aware of how dull and di cult I can be. I just want to say, in my way, I love you all.

show abstract

Efficient Self-Supervised Learning Representations for Spoken Language Identification

Liu

García

Khong

et al. 2022

IEEE J. Sel. Top. Signal Process.

View full text Add to dashboard Cite

Study On the Temporal Pooling Used In Deep Neural Networks For Speaker Verification

Cited by 3 publications

References 12 publications

Online Continual Learning in Keyword Spotting for Low-Resource Devices via Pooling High-Order Temporal Statistics

Online Continual Learning in Keyword Spotting for Low-Resource Devices via Pooling High-Order Temporal Statistics

Enhancing spoken language identification and diarization for multilingual speech

Efficient Self-Supervised Learning Representations for Spoken Language Identification

Contact Info

Product

Resources

About