ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414941
|View full text |Cite
|
Sign up to set email alerts
|

Partially Overlapped Inference for Long-Form Speech Recognition

Abstract: While the end-to-end speech recognition models show impressive performance on many domains, they have difficulties in decoding long-form utterances. The overlapped inference algorithm with tie-breaking between two parallel hypotheses has been proposed for long-form speech recognition and shows dramatic performance improvements at the expense of double computational costs. In this paper, we propose a more effective way of overlapped inference by aligning partially matched hypotheses. Through the experiment on L… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(4 citation statements)
references
References 15 publications
0
4
0
Order By: Relevance
“…However, even though computation cost reduces when using lower overlapping percentage, WER degrades monotonically. For algorithm details, please refer to [15].…”
Section: Partial Overlapping Inferencementioning
confidence: 99%
See 2 more Smart Citations
“…However, even though computation cost reduces when using lower overlapping percentage, WER degrades monotonically. For algorithm details, please refer to [15].…”
Section: Partial Overlapping Inferencementioning
confidence: 99%
“…where d i (j) denotes the j-th word in the i-th segment, w sub and w match are substitution cost and matching reward, respectively, and e sub is the operation cost [15]. A substitution error is omitted no matter how similar two words are.…”
Section: Soft-matchmentioning
confidence: 99%
See 1 more Smart Citation
“…However, the generalization capability of the AED models to long-form speech is poor [4,15], and how to mitigate this problem is still an open question. Several methods have tackled this problem by incorporating alignment information to the training as supervision [14,16,17], window-based overlapped offline inference [4,18], modifying LSTM encoder states [3], and adopting new architecture [12,15]. It is also a common practice to segment long-form audio with a separate voice activity detection (VAD) model in advance [19].…”
Section: Introductionmentioning
confidence: 99%