NMT-Based Segmentation and Punctuation Insertion for Real-Time Spoken Language Translation

Nguyen²,

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-Ordination and Standardisation of Speech Databa

et al. 2019

In recent years, studies on automatic speech recognition (ASR) have shown outstanding results that reach human parity on short speech segments. However, there are still difficulties in standardizing the output of ASR such as capitalization and punctuation restoration for long-speech transcription. The problems obstruct readers to understand the ASR output semantically and also cause difficulties for natural language processing models such as NER, POS and semantic parsing. In this paper, we propose a method to restore the punctuation and capitalization for long-speech ASR transcription. The method is based on Transformer models and chunk merging that allows us to (1), build a single model that performs punctuation and capitalization in one go, and (2), perform decoding in parallel while improving the prediction accuracy. Experiments on British National Corpus showed that the proposed approach outperforms existing methods in both accuracy and decoding speed.

Section: Evaluation On Plain-text Model and Encoded-text Modelmentioning

confidence: 83%

“…An example is shown in Figure 4. We prepared 2 formats of training data: plain text and encoded text [9]. Both formats takes the lowercase text without punctuation as input.…”

Section: Data Preparationmentioning

confidence: 99%

See 1 more Smart Citation

Fast and Accurate Capitalization and Punctuation for Automatic Speech Recognition Using Transformer and Chunk Merging

Nguyen²,

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-Ordination and Standardisation of Speech Databa

et al. 2019

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

“…Previous work studied different decoding strategies to reduce the latency for real-time spoken language processing systems, including overlapping windows [7], streaming input scheme [8], and overlapped-chunk split and merging strategy [9]. However, the input text for inference in these decoding strategies does not always begin with the first word of a sentence.…”

Section: Introductionmentioning

confidence: 99%

Controllable Time-Delay Transformer for Real-Time Punctuation Prediction and Disfluency Detection

Qian

Chen

et al. 2020

With the increased applications of automatic speech recognition (ASR) in recent years, it is essential to automatically insert punctuation marks and remove disfluencies in transcripts, to improve the readability of the transcripts as well as the performance of subsequent applications, such as machine translation, dialogue systems, and so forth. In this paper, we propose a Controllable Time-delay Transformer (CT-Transformer) model that jointly completes the punctuation prediction and disfluency detection tasks in real time. The CT-Transformer model facilitates freezing partial outputs with controllable time delay to fulfill the real-time constraints in partial decoding required by subsequent applications. We further propose a fast decoding strategy to minimize latency while maintaining competitive performance. Experimental results on the IWSLT2011 benchmark dataset and an in-house Chinese annotated dataset demonstrate that the proposed approach outperforms the previous state-of-the-art models on F-scores and achieves a competitive inference speed.

Statistical Language and Speech Processing

“…Punctuation is essential for grammaticality, readability, and (in the case of a number of different tasks), subsequent processing. Thus, correct sentence segmentation and punctuation of recognized speech improves the quality of machine translation [6,7,24,26], and missing periods and commas in machine generated text results in suboptimal information extraction from speech [13,15]. Also, most of the data-driven parsing models use punctuation as features.…”

Section: Introductionmentioning

confidence: 99%

Attentional Parallel RNNs for Generating Punctuation in Transcribed Speech

Öktem

Farrús

Wanner

2017

Abstract. Until very recently, the generation of punctuation marks for automatic speech recognition (ASR) output has been mostly done by looking at the syntactic structure of the recognized utterances. Prosodic cues such as breaks, speech rate, pitch intonation that influence placing of punctuation marks on speech transcripts have been seldom used. We propose a method that uses recurrent neural networks, taking prosodic and lexical information into account in order to predict punctuation marks for raw ASR output. Our experiments show that an attention mechanism over parallel sequences of prosodic cues aligned with transcribed speech improves accuracy of punctuation generation.