Multimodal Semi-Supervised Learning Framework for Punctuation Prediction in Conversational Speech

Sunkara, Monica; Ronanki, Srikanth; Bekal, Dhanush; Bodapati, Sravan; Kirchhoff, Katrin

doi:10.21437/interspeech.2020-3074

Cited by 17 publications

(12 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Moreover, in Reference [96], sentiment was predicted with the help of the multimodal approach. Punctuation predicted from conversational speech, using semi-supervised multimodal fusion techniques, is presented in Reference [97]. The hierarchical fusion technique was used for sentiment analysis using TAF data and social images [98][99][100].…”

Section: Discussionmentioning

confidence: 99%

Data Harmonization for Heterogeneous Datasets: A Systematic Literature Review

et al. 2021

View full text Add to dashboard Cite

As data size increases drastically, its variety also increases. Investigating such heterogeneous data is one of the most challenging tasks in information management and data analytics. The heterogeneity and decentralization of data sources affect data visualization and prediction, thereby influencing analytical results accordingly. Data harmonization (DH) corresponds to a field that unifies the representation of such a disparate nature of data. Over the years, multiple solutions have been developed to minimize the heterogeneity aspects and disparity in formats of big-data types. In this study, a systematic review of the literature was conducted to assess the state-of-the-art DH techniques. This study aimed to understand the issues faced due to heterogeneity, the need for DH and the techniques that deal with substantial heterogeneous textual datasets. The process produced 1355 articles, but among them, only 70 articles were found to be relevant through inclusion and exclusion criteria methods. The result shows that the heterogeneity of structured, semi-structured, and unstructured (SSU) data can be managed by using DH and its core techniques, such as text preprocessing, Natural Language Preprocessing (NLP), machine learning (ML), and deep learning (DL). These techniques are applied to many real-world applications centered on the information-retrieval domain. Several assessment criteria were implemented to measure the efficiency of these techniques, such as precision, recall, F-1, accuracy, and time. A detailed explanation of each research question, common techniques, and performance measures is also discussed. Lastly, we present readers with a detailed discussion of the existing work, contributions, and managerial and academic implications, along with the conclusion, limitations, and future research directions.

show abstract

Section: Discussionmentioning

confidence: 99%

Data Harmonization for Heterogeneous Datasets: A Systematic Literature Review

et al. 2021

View full text Add to dashboard Cite

show abstract

“…We instead use a single prediction for each token, and we find that we can achieve superior performance using much smaller context windows than [1]. Finally, [17,18] apply transformers to punctuation prediction using lexical features and prosodic features which are aligned using pre-trained feature extractors and alignment networks. In contrast to [17,18], we use forced-alignment from ASR and learn acoustic features from scratch from spectrogram segments corresponding each text tokens.…”

Section: Related Workmentioning

confidence: 99%

“…We seek to understand the effects of a multimodal approach on punctuation prediction with varying amounts of future information. While multimodal approaches are common for punctuation prediction [10,18,16], we are the first to incorporate learned acoustic features from scratch using forcealignment from ASR rather than relying on other data to pretrain or hand-select acoustic features.…”

Section: Podcast Taskmentioning

confidence: 99%

Multimodal Punctuation Prediction with Contextual Dropout

Silva

Theobald

Apostoloff

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Automatic speech recognition (ASR) is widely used in consumer electronics. ASR greatly improves the utility and accessibility of technology, but usually the output is only word sequences without punctuation. This can result in ambiguity in inferring user-intent. We first present a transformerbased approach for punctuation prediction that achieves 8% improvement on the IWSLT 2012 TED Task, beating the previous state of the art [1]. We next describe our multimodal model that learns from both text and audio, which achieves 8% improvement over the text-only algorithm on an internal dataset for which we have both the audio and transcriptions. Finally, we present an approach to learning a model using contextual dropout that allows us to handle variable amounts of future context at test time.

show abstract

“…Speech signal holds some cues such as pauses and intonation patterns to predict punctuation marks [14]. Incorporation of speech cues to the text-based models is explored in [15,16] and have shown improvements in punctuation prediction. The distribution mismatch between text and conversational domains can be mitigated by retrofitting word embeddings to the target domain [17] when GloVe [18] embeddings are used in the model.…”

Section: Related Workmentioning

confidence: 99%

Joint prediction of truecasing and punctuation for conversational speech in low-resource scenarios

Pappagari¹,

Żelasko²,

Pęzik³

et al. 2021

Preprint

View full text Add to dashboard Cite

Capitalization and punctuation are important cues for comprehending written texts and conversational transcripts. Yet, many ASR systems do not produce punctuated and case-formatted speech transcripts. We propose to use a multitask system that can exploit the relations between casing and punctuation to improve their prediction performance. Whereas text data for predicting punctuation and truecasing is seemingly abundant, we argue that written text resources are inadequate as training data for conversational models. We quantify the mismatch between written and conversational text domains by comparing the joint distributions of punctuation and word cases, and by testing our model cross-domain. Further, we show that by training the model in the written text domain and then transfer learning to conversations, we can achieve reasonable performance with less data.

show abstract

Multimodal Semi-Supervised Learning Framework for Punctuation Prediction in Conversational Speech

Cited by 17 publications

References 0 publications

Data Harmonization for Heterogeneous Datasets: A Systematic Literature Review

Data Harmonization for Heterogeneous Datasets: A Systematic Literature Review

Multimodal Punctuation Prediction with Contextual Dropout

Joint prediction of truecasing and punctuation for conversational speech in low-resource scenarios

Contact Info

Product

Resources

About