Paragraph-based prosodic cues for speech synthesis applications

Statistical Language and Speech Processing

Wanner

2017

Self Cite

Abstract. Until very recently, the generation of punctuation marks for automatic speech recognition (ASR) output has been mostly done by looking at the syntactic structure of the recognized utterances. Prosodic cues such as breaks, speech rate, pitch intonation that influence placing of punctuation marks on speech transcripts have been seldom used. We propose a method that uses recurrent neural networks, taking prosodic and lexical information into account in order to predict punctuation marks for raw ASR output. Our experiments show that an attention mechanism over parallel sequences of prosodic cues aligned with transcribed speech improves accuracy of punctuation generation.

Section: Datamentioning

confidence: 99%

Section: Datamentioning

confidence: 99%

Section: Datamentioning

confidence: 99%

See 1 more Smart Citation

Attentional Parallel RNNs for Generating Punctuation in Transcribed Speech

Öktem

Statistical Language and Speech Processing

Wanner

2017

Self Cite

“…Based on previous analysis of paragraph prosody [18], we calculated aggregate statistics for each sentence: mean, standard deviation, maximum, minimum, median, slope, range (99th-1st quantiles). We also record the values for the previous and next sentences, as well as their differences to the target, and the difference between the first and last word of the target.…”

Section: Prosodic Featuresmentioning

confidence: 99%

“…Similarly, prosodic features based on pitch, energy and timing have been used to perform topic segmentation on their own [13,14,15] or in conjunction with lexical features [8,12,16,17]. While pause duration appears to be the most robust segmentation cue, paragraphs also seem to follow general prosodic declination and reset patterns [18]. So, we expect prosody to be informative of paragraph breaks.…”

Section: Introductionmentioning

confidence: 99%

Automatic Paragraph Segmentation with Lexical and Prosodic Features

Lai¹,

Moore³

2016

Interspeech 2016

Self Cite

As long-form spoken documents become more ubiquitous in everyday life, so does the need for automatic discourse segmentation in spoken language processing tasks. Although previous work has focused on broad topic segmentation, detection of finer-grained discourse units, such as paragraphs, is highly desirable for presenting and analyzing spoken content. To better understand how different aspects of speech cue these subtle discourse transitions, we investigate automatic paragraph segmentation of TED talks. We build lexical and prosodic paragraph segmenters using Support Vector Machines, AdaBoost, and Long Short Term Memory (LSTM) recurrent neural networks. In general, we find that induced cue words and supra-sentential prosodic features outperform features based on topical coherence, syntactic form and complexity. However, our best performance is achieved by combining a wide range of individually weak lexical and prosodic features, with the sequence modelling LSTM generally outperforming the other classifiers by a large margin. Moreover, we find that models that allow lower level interactions between different feature types produce better results than treating lexical and prosodic contributions as separate, independent information sources.

Corpora compilation for prosody-informed speech processing

Öktem

Lang Resources & Evaluation

Bonafonte

2021

Self Cite

Research on speech technologies necessitates spoken data, which is usually obtained through read recorded speech, and specifically adapted to the research needs. When the aim is to deal with the prosody involved in speech, the available data must reflect natural and conversational speech, which is usually costly and difficult to get. This paper presents a machine learning-oriented toolkit for collecting, handling, and visualization of speech data, using prosodic heuristic. We present two corpora resulting from these methodologies: PANTED corpus, containing 250 h of English speech from TED Talks, and Heroes corpus containing 8 h of parallel English and Spanish movie speech. We demonstrate their use in two deep learning-based applications: punctuation restoration and machine translation. The presented corpora are freely available to the research community.