Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-277
|View full text |Cite
|
Sign up to set email alerts
|

Modeling and Training Strategies for Language Recognition Systems

Abstract: Automatic speech recognition is complementary to language recognition. The language recognition systems exploit this complementarity by using frame-level bottleneck features extracted from neural networks trained with a phone recognition task. Recent methods apply frame-level bottleneck features extracted from an end-to-end sequence-to-sequence speech recognition model. In this work, we study an integrated approach of the training of the speech recognition feature extractor and language recognition modules. We… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
5
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 8 publications
(6 citation statements)
references
References 32 publications
1
5
0
Order By: Relevance
“…1 in Table 4. We get the same conclusion as [16], that is, finetuning the LID task with the unfrozen encoder outperforms that with the frozen encoder. Note that although No.…”
Section: Experimental Results Under Different Training Strategiessupporting
confidence: 70%
See 3 more Smart Citations
“…1 in Table 4. We get the same conclusion as [16], that is, finetuning the LID task with the unfrozen encoder outperforms that with the frozen encoder. Note that although No.…”
Section: Experimental Results Under Different Training Strategiessupporting
confidence: 70%
“…It can be observed that these methods differ in whether the ASR encoder is frozen or not during the second stage of training. It is shown in [16] that the unfrozen encoder is superior in the recognition accuracy. In our preliminary experiments, we tried these two training strategies mentioned above and extracted fixed-length embeddings from some cross channel test data.…”
Section: A Trade-off Between the Recognition Accuracy And The General...mentioning
confidence: 99%
See 2 more Smart Citations
“…Bottleneck features. An acoustic model trained for ASR can also be used for other tasks which rely on the phonetic content but do not require a word-level transcription, such as language identification [63] or keyword spotting [77]. In such cases, instead of using the acoustic model output (triphone posterior probabilities), a sequence of phonetic features called bottleneck (BN) features is extracted from an intermediate layer of the acoustic model [81] and used, possibly in combination with other features, as input to these tasks.…”
Section: Speech Processingmentioning
confidence: 99%