Confidence Score Based Speaker Adaptation of Conformer Speech Recognition Systems

Xie, Xurong; Wang, Tianzi; Cui, Mingyu; Xue, Boyang; Jin, Zengrui; Li, Guinan; Hu, Shujie; Liu, Xunying

doi:10.1109/taslp.2023.3250842

Cited by 4 publications

(3 citation statements)

References 109 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These include, but not limited to: 1) auxiliary speaker embedding based approaches [14][15][16][17][18], e.g. iVector [14] and xVector [15]; 2) feature transformation based methods, e.g., feature-space MLLR [19]; and 3) model-based methods [20][21][22][23] that estimate speaker dependent (SD) adapter parameters implemented as, e.g. learning hidden unit contributions (LHUC) [21], during speaker adaptive training (SAT) and test-time unsupervised adaptation [22,23].…”

Section: Introductionmentioning

confidence: 99%

“…3) This paper presents the first investigation of the complete incorporation of speaker features into all the components of a complete end-to-end audio-visual multichannel speech separation and recognition system. In contrast, prior researches consider speaker adaptation of either the speech separation front-end [12,[24][25][26][27][28][29][30][31][32] alone, or the speech recognition backend [14,[16][17][18][19][21][22][23] only.…”

Section: Introductionmentioning

confidence: 99%

“…In contrast, the related prior studies only focus on less practical enrollment-based adaptation techniques[24][25][26][27][28][29][30][31][32] that require clean speech samples to be explicitly recorded at the onset of user personalization.3) This paper presents the first investigation of the complete incorporation of speaker features into all the components of a complete end-to-end audio-visual multichannel speech separation and recognition system. In contrast, prior researches consider speaker adaptation of either the speech separation front-end[12,[24][25][26][27][28][29][30][31][32] alone, or the speech recognition backend[14,[16][17][18][19][21][22][23] only.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Identity of university Chinese heritage language learners in Hong Kong

Li¹,

李蓁²

View full text Add to dashboard Cite

The core of out-of-distribution (OOD) detection is to learn the in-distribution (ID) representation, which is distinguishable from OOD samples. Previous work applied recognition-based methods to learn the ID features, which tend to learn shortcuts instead of comprehensive representations. In this work, we find surprisingly that simply using reconstruction-based methods could boost the performance of OOD detection significantly. We deeply explore the main contributors of OOD detection and find that reconstructionbased pretext tasks have the potential to provide a generally applicable and efficacious prior, which benefits the model in learning intrinsic data distributions of the ID dataset. Specifically, we take Masked Image Modeling as a pretext task for our OOD detection framework (MOOD). Without bells and whistles, MOOD outperforms previous SOTA of one-class OOD detection by 5.7%, multi-class OOD detection by 3.0%, and near-distribution OOD detection by 2.1%. It even defeats the 10-shot-per-class outlier exposure OOD detection, although we do not include any OOD samples for our detection.

show abstract