5th International Workshop on Speech Processing in Everyday Environments (CHiME 2018) 2018
DOI: 10.21437/chime.2018-1
|View full text |Cite
|
Sign up to set email alerts
|

The STC System for the CHiME 2018 Challenge

Abstract: This paper describes the Speech Technology Center (STC) system for the 5th CHiME challenge. This challenge considers the problem of distant multi-microphone conversational speech recognition in everyday home environments. Our efforts were focused on the single-array track, however, we participated in the multiple-array track as well. The system is in the ranking A of the challenge: acoustic models remain frame-level tied phonetic targets, lexicon and language model are not changed compared to the conventional … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
12
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
7
1
1

Relationship

1
8

Authors

Journals

citations
Cited by 29 publications
(12 citation statements)
references
References 54 publications
0
12
0
Order By: Relevance
“…i-vectors, is a widely used technique [35]. Our CHiME5 experience [36] revealed that speaker adaptation is extremely useful not only for telephone and microphone speech recognition tasks but for distant ASR as well. Based on this experience, we decided to apply modern x-vectors speaker embeddings [16,17] instead of i-vectors for speaker adaptation.…”
Section: Acoustic Model Adaptation Using Speaker Embeddings and Rir Ementioning
confidence: 99%
“…i-vectors, is a widely used technique [35]. Our CHiME5 experience [36] revealed that speaker adaptation is extremely useful not only for telephone and microphone speech recognition tasks but for distant ASR as well. Based on this experience, we decided to apply modern x-vectors speaker embeddings [16,17] instead of i-vectors for speaker adaptation.…”
Section: Acoustic Model Adaptation Using Speaker Embeddings and Rir Ementioning
confidence: 99%
“…Although it's inferior to the USTC-iFlytek's, our system perform separation only once and has low computational complexity and model size apparently. The details on each ses- [32] 62.09 Toshiba [33] 63.30 STC [23] 63.30 RWTH-Paderborn [34] 68.40 Official [13] 80.28 sion and location over official AM and ours are given in Table 6. Even based on the official backend, our SD separation frontend contributes a 10% WER reduction, which is a significant improvement on this challenging task.…”
Section: Speaker-aware Trainingmentioning
confidence: 99%
“…In [14], suffering from low-quality training targets, the system just achieved 2% absolute reduction on WER. Second, inspired by [21,22,23], we incorporate i-vectors as auxiliary features, which aims to extract the target speaker. With the speaker-aware training technique, we achieve much better results using only one mask estimation model.…”
Section: Introductionmentioning
confidence: 99%
“…overlapped sentences or unfinished utterances, noises from home appliances at a signal-to-noise ratio (SNR) between 5 and 20 dB, distant microphone speech, and a small training dataset of 40 h (i.e., low resources). Most competitive systems, except for [12], in the fifth CHiME challenge employ conventional ASR methods with multichannel speech enhancement techniques [15], [16], [17], [18].…”
Section: Introductionmentioning
confidence: 99%