Partially Overlapped Inference for Long-Form Speech Recognition

Kang, Tae Gyoon; Kim, Ho-Gyeong; Lee, Min-Joong; Lee, Ji‐Hyun; Lee, Hoshik

doi:10.1109/icassp39728.2021.9414941

Cited by 7 publications

(4 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, even though computation cost reduces when using lower overlapping percentage, WER degrades monotonically. For algorithm details, please refer to [15].…”

Section: Partial Overlapping Inferencementioning

confidence: 99%

“…where d i (j) denotes the j-th word in the i-th segment, w sub and w match are substitution cost and matching reward, respectively, and e sub is the operation cost [15]. A substitution error is omitted no matter how similar two words are.…”

Section: Soft-matchmentioning

confidence: 99%

“…But there is a large performance gap using a shorter interval. Thus, partial overlapping inference (POI) is introduced to solve the nonoverlapped region problem using different margin conditions [15]. But POI still degrades recognition accuracy due to the lack of common words under lower overlapping percentages.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

VADOI:Voice-Activity-Detection Overlapping Inference For End-to-end Long-form Speech Recognition

Wang¹,

Tong²,

Guo³

et al. 2022

Preprint

View full text Add to dashboard Cite

While end-to-end models have shown great success on the Automatic Speech Recognition task, performance degrades severely when target sentences are long-form. The previous proposed methods, (partial) overlapping inference are shown to be effective on long-form decoding. For both methods, word error rate (WER) decreases monotonically when overlapping percentage decreases. Setting aside computational cost, the setup with 50% overlapping during inference can achieve the best performance. However, a lower overlapping percentage has an advantage of fast inference speed. In this paper, we first conduct comprehensive experiments comparing overlapping inference and partial overlapping inference with various configurations. We then propose Voice-Activity-Detection Overlapping Inference to provide a trade-off between WER and computation cost. Results show that the proposed method can achieve a 20% relative computation cost reduction on Librispeech and Microsoft Speech Language Translation long-form corpus while maintaining the WER performance when comparing to the best performing overlapping inference algorithm. We also propose Soft-Match to compensate for similar words mis-aligned problem.

show abstract

“…However, even though computation cost reduces when using lower overlapping percentage, WER degrades monotonically. For algorithm details, please refer to [15].…”

Section: Partial Overlapping Inferencementioning

confidence: 99%

Section: Soft-matchmentioning

confidence: 99%

See 1 more Smart Citation

VADOI:Voice-Activity-Detection Overlapping Inference For End-to-end Long-form Speech Recognition

Wang¹,

Tong²,

Guo³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…However, the generalization capability of the AED models to long-form speech is poor [4,15], and how to mitigate this problem is still an open question. Several methods have tackled this problem by incorporating alignment information to the training as supervision [14,16,17], window-based overlapped offline inference [4,18], modifying LSTM encoder states [3], and adopting new architecture [12,15]. It is also a common practice to segment long-form audio with a separate voice activity detection (VAD) model in advance [19].…”

Section: Introductionmentioning

confidence: 99%

VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording

Inaguma¹,

Kawahara²

2021

Preprint

View full text Add to dashboard Cite

In this work, we propose novel decoding algorithms to enable streaming automatic speech recognition (ASR) on unsegmented long-form recordings without voice activity detection (VAD), based on monotonic chunkwise attention (MoChA) with an auxiliary connectionist temporal classification (CTC) objective. We propose a block-synchronous beam search decoding to take advantage of efficient batched output-synchronous and low-latency input-synchronous searches. We also propose a VAD-free inference algorithm that leverages CTC probabilities to determine a suitable timing to reset the model states to tackle the vulnerability to long-form data. Experimental evaluations demonstrate that the block-synchronous decoding achieves comparable accuracy to the label-synchronous one. Moreover, the VAD-free inference can recognize long-form speech robustly for up to a few hours.

show abstract