Improving ASR by integrating lecture audio and slides

Miranda, J. M.; Neto, João Paulo; Black, Alan W.

doi:10.1109/icassp.2013.6639249

Cited by 7 publications

(3 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This approach was examined for several types of target contents: a pair of the lecture speech transcription and its lecture slide [7], and a pair of the discussion speech transcription and its target newspaper article [8]. The other approach, a machine translation based method, was employed to align a lecture speech signal and its lecture slide, in order to improve automatic speech recognition performance [11]. This paper focuses into the alignment problem between lecture utterances and lecture slide components, thus, obviously belongs to the second category.…”

Section: Related Workmentioning

confidence: 99%

Automatic Alignment Between Classroom Lecture Utterances and Slide Components

Tsuchiya

Minamiguchi²

2017

Interspeech 2017

View full text Add to dashboard Cite

Multimodal alignment between classroom lecture utterances and lecture slide components is one of the crucial problems to realize a multimodal e-Learning application. This paper proposes the new method for the automatic alignment, and formulates the alignment as the integer linear programming (ILP) problem to maximize the score function which consists of three factors: the similarity score between utterances and slide components, the consistency of the explanation order, and the explanation coverage of slide components. The experimental result on the Corpus of Japanese classroom Lecture Contents (CJLC) shows that the automatic alignment information acquired by the proposed method is effective to improve the performance of the automatic extraction of important utterances.

show abstract

Section: Related Workmentioning

confidence: 99%

Automatic Alignment Between Classroom Lecture Utterances and Slide Components

Tsuchiya

Minamiguchi²

2017

Interspeech 2017

View full text Add to dashboard Cite

show abstract

“…If only ASR technology is used, it may lead to the wrong recognition of proprietary entities in the current slide. In the field of ASR assisted by slides, some early papers use slides to build language models [22,23], while others use complete static slides to extract rare words and improve results using a contextual bias ASR model [24].…”

Section: Introductionmentioning

confidence: 99%

Assembling Alibaba:

Lin¹

2021

Engaging Social Media in China

View full text Add to dashboard Cite

Masked Language Modeling (MLM) is widely used to pretrain language models. The standard random masking strategy in MLM causes the pre-trained language models (PLMs) to be biased towards high-frequency tokens. Representation learning of rare tokens is poor and PLMs have limited performance on downstream tasks. To alleviate this frequency bias issue, we propose two simple and effective Weighted Sampling strategies for masking tokens based on token frequency and training loss. We apply these two strategies to BERT and obtain Weighted-Sampled BERT (WSBERT). Experiments on the Semantic Textual Similarity benchmark (STS) show that WSBERT significantly improves sentence embeddings over BERT. Combining WSBERT with calibration methods and prompt learning further improves sentence embeddings. We also investigate fine-tuning WSBERT on the GLUE benchmark and show that Weighted Sampling also improves the transfer learning capability of the backbone PLM. We further analyze and provide insights into how WSBERT improves token embeddings.

show abstract

“…Assuming that the Word Error Rate (WER) metric is not relevant enough to compare the ASR system performance for such specific tasks [1,2], we explore the use of more relevant evaluation metrics to analyse the effects of the ASR language model adaptation. Language Model (LM) adaptation of spoken lectures is a well-known issue in the literature [3,4,5,6,7,8,9]. In 2002, [10] authors already demonstrated that the use of a topic-related vocabulary improves speech recognition and indexing for video lectures.…”

Section: Introductionmentioning

confidence: 99%

Qualitative Evaluation of ASR Adaptation in a Lecture Context: Application to the PASTEL Corpus

Mdhaffar¹,

Estève²,

Hernández³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

Lectures are usually known to be highly specialised in that they deal with multiple and domain specific topics. This context is challenging for Automatic Speech Recognition (ASR) systems since they are sensitive to topic variability. Language Model (LM) adaptation is a commonly used technique to address the mismatch problem between training and test data. In this paper, we are interested in a qualitative analysis in order to relevantly compare the accuracy of the LM adaptation. While word error rate is the most common metric used to evaluate ASR systems, we consider that this metric cannot provide accurate information. Consequently, we explore the use of other metrics based on individual word error rate, indexability, and capability of building relevant requests for information retrieval from the ASR outputs. Experiments are carried out on the PASTEL corpus, a new dataset in French language, composed of lecture recordings, manual chaptering, manual transcriptions, and slides. While an adapted LM allows us to reduce the global classical word error rate by 15.62% in relative, we show that this reduction reaches 44.2% when computed on relevant words only. These observations are confirmed with the high LM adaptation gains obtained with indexability and information retrieval metrics.

show abstract

Improving ASR by integrating lecture audio and slides

Cited by 7 publications

References 11 publications

Automatic Alignment Between Classroom Lecture Utterances and Slide Components

Automatic Alignment Between Classroom Lecture Utterances and Slide Components

Assembling Alibaba:

Qualitative Evaluation of ASR Adaptation in a Lecture Context: Application to the PASTEL Corpus

Contact Info

Product

Resources

About