ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414928
|View full text |Cite
|
Sign up to set email alerts
|

Hierarchical Transformer-Based Large-Context End-To-End ASR with Large-Context Knowledge Distillation

Abstract: We present a novel large-context end-to-end automatic speech recognition (E2E-ASR) model and its effective training method based on knowledge distillation. Common E2E-ASR models have mainly focused on utterance-level processing in which each utterance is independently transcribed. On the other hand, large-context E2E-ASR models, which take into account long-range sequential contexts beyond utterance boundaries, well handle a sequence of utterances such as discourses and conversations. However, the transformer … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 16 publications
(4 citation statements)
references
References 26 publications
0
3
0
Order By: Relevance
“…As the shared task is given with a separate training data set, an effective model has to be created during the training. Therefore, hierarchical transformer based model for large context end to end ASR can be used (Masumura et al, 2021). In the recent era, the environment is changing with smart systems and is identified that there is a need for ASR systems that are capable of handling speech of elderly people spoken in their native languages.…”
Section: Related Workmentioning
confidence: 99%
“…As the shared task is given with a separate training data set, an effective model has to be created during the training. Therefore, hierarchical transformer based model for large context end to end ASR can be used (Masumura et al, 2021). In the recent era, the environment is changing with smart systems and is identified that there is a need for ASR systems that are capable of handling speech of elderly people spoken in their native languages.…”
Section: Related Workmentioning
confidence: 99%
“…Increase in WER value happens if the quality of recorded speech is low (Iribe et al, 2015). E2E ASR transformer can do encoding and decoding hierarchically by combining the transformers for large context (Masumura et al, 2021). Using the Hybrid based LSTM transformer, the WER is reduced with 25.4% by transfer learning.…”
Section: Related Workmentioning
confidence: 99%
“…Recent model compression works fall under three general classes: Pruning which forces some weights or activations to zero [20-22, 26, 32, 34, 40, 47] combined with "zero-aware" memory encoding. Knowledge distillation distills a larger, "teacher" model into a smaller "student" model [1,17,24,27,29,37,43]. Lastly, quantization where the parameters and/or activations are quantized to shorter bit-widths [6,19,39,50,51,54].…”
Section: Introductionmentioning
confidence: 99%