Generalized Large-Context Language Models Based on Forward-Backward Hierarchical Recurrent Encoder-Decoder Models

Masumura, Ryo; Ihori, Mana; Tanaka, Tomohiro; Saito, Itsumi; Nishida, K.; Oba, Tadamichi

doi:10.1109/asru46091.2019.9003857

“…This section details our proposed large-context knowledge distillation method as an effective training method of large-context E2E-ASR models. Our key idea is to mimic the behavior of a large-context language model [9][10][11][12][13] pre-trained from the same training datasets. A large-context language model defines the generation probability of a sequence of utterance-level texts W = {W1, • • • , WT } as…”

Section: Large-context Knowledge Distillationmentioning

confidence: 99%

“…In the decoder, both the continuous representations produced by the hierarchical transformer and input speech contexts are simultaneously taken into consideration using two multi-head source-target attention layers. Moreover, since it is difficult to effectively exploit the large-contexts beyond utterance boundaries, we also propose a large-context knowledge distillation method using a large-context language model [9][10][11][12][13]. This method enables our large-context E2E-ASR model to use the largecontexts beyond utterance boundaries by mimicking the behavior of the pre-trained large-context language model.…”

Section: Introductionmentioning

confidence: 99%

Hierarchical Transformer-Based Large-Context End-To-End ASR with Large-Context Knowledge Distillation

Masumura

¹

,

Makishima

²

,

Ihori

³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

We present a novel large-context end-to-end automatic speech recognition (E2E-ASR) model and its effective training method based on knowledge distillation. Common E2E-ASR models have mainly focused on utterance-level processing in which each utterance is independently transcribed. On the other hand, large-context E2E-ASR models, which take into account long-range sequential contexts beyond utterance boundaries, well handle a sequence of utterances such as discourses and conversations. However, the transformer architecture, which has recently achieved state-of-the-art ASR performance among utterance-level ASR systems, has not yet been introduced into the large-context ASR systems. We can expect that the transformer architecture can be leveraged for effectively capturing not only input speech contexts but also long-range sequential contexts beyond utterance boundaries. Therefore, this paper proposes a hierarchical transformer-based large-context E2E-ASR model that combines the transformer architecture with hierarchical encoder-decoder based large-context modeling. In addition, in order to enable the proposed model to use long-range sequential contexts, we also propose a large-context knowledge distillation that distills the knowledge from a pre-trained large-context language model in the training phase. We evaluate the effectiveness of the proposed model and proposed training method on Japanese discourse ASR tasks.

show abstract

“…Other studies have shown that large-context end-to-end methods offer a superior performance to utterance-level or sentence-level end-to-end methods in automatic speech recognition [25][26][27], machine translation [28][29][30], and response generation for dialogue systems [31,32]. Furthermore, large-context language models that can consider not only past but also future contexts have been presented [19]. In this paper, we utilize large-context language models for self-supervised learning specialized to conversational documents.…”

Section: Related Workmentioning

confidence: 99%

“…Our concept is to estimate an utterance by using all the surrounding utterances. To this end, we introduce a novel large-context language model, which is an extended model of the forward-backward hierarchical recurrent encoder-decoder [19], so that we can estimate not only linguistic information but also speaker information. After performing the self-supervised learning, we utilize the pre-trained network for building state-of-the-art utterancelevel sequential labeling based on hierarchical bidirectional long short-term memory recurrent neural network conditional random fields (H-BLSTM-CRF) [6,7].…”

Section: Introductionmentioning

confidence: 99%

Large-Context Conversational Representation Learning: Self-Supervised Learning for Conversational Documents

Masumura

¹

,

Makishima

²

,

Ihori

³

et al. 2021

Preprint

Self Cite

0

View full text Add to dashboard Cite

This paper presents a novel self-supervised learning method for handling conversational documents consisting of transcribed text of human-to-human conversations. One of the key technologies for understanding conversational documents is utterance-level sequential labeling, where labels are estimated from the documents in an utterance-by-utterance manner. The main issue with utterance-level sequential labeling is the difficulty of collecting labeled conversational documents, as manual annotations are very costly. To deal with this issue, we propose large-context conversational representation learning (LC-CRL), a self-supervised learning method specialized for conversational documents. A self-supervised learning task in LC-CRL involves the estimation of an utterance using all the surrounding utterances based on large-context language modeling. In this way, LC-CRL enables us to effectively utilize unlabeled conversational documents and thereby enhances the utterance-level sequential labeling. The results of experiments on scene segmentation tasks using contact center conversational datasets demonstrate the effectiveness of the proposed method.

show abstract

“…Representative methods are used to distill knowledge from an external language model to improve the capturing of linguistic contexts [24,25]. Our proposed large-context knowledge distillation method is regarded as an extension of the latter methods to enable the capturing of all preceding linguistic contexts beyond utterance boundaries using large-context language models [9][10][11][12][13].…”

Section: Related Workmentioning

confidence: 99%

Hierarchical Transformer-based Large-Context End-to-end ASR with Large-Context Knowledge Distillation

Masumura

¹

,

Makishima

²

,

Ihori

³

et al. 2021

Preprint

Self Cite

0

View full text Add to dashboard Cite

We present a novel large-context end-to-end automatic speech recognition (E2E-ASR) model and its effective training method based on knowledge distillation. Common E2E-ASR models have mainly focused on utterance-level processing in which each utterance is independently transcribed. On the other hand, large-context E2E-ASR models, which take into account long-range sequential contexts beyond utterance boundaries, well handle a sequence of utterances such as discourses and conversations. However, the transformer architecture, which has recently achieved state-of-the-art ASR performance among utterance-level ASR systems, has not yet been introduced into the large-context ASR systems. We can expect that the transformer architecture can be leveraged for effectively capturing not only input speech contexts but also long-range sequential contexts beyond utterance boundaries. Therefore, this paper proposes a hierarchical transformer-based large-context E2E-ASR model that combines the transformer architecture with hierarchical encoder-decoder based large-context modeling. In addition, in order to enable the proposed model to use long-range sequential contexts, we also propose a large-context knowledge distillation that distills the knowledge from a pre-trained large-context language model in the training phase. We evaluate the effectiveness of the proposed model and proposed training method on Japanese discourse ASR tasks.

show abstract

Generalized Large-Context Language Models Based on Forward-Backward Hierarchical Recurrent Encoder-Decoder Models

Cited by 5 publications

References 29 publications

Hierarchical Transformer-Based Large-Context End-To-End ASR with Large-Context Knowledge Distillation

Hierarchical Transformer-Based Large-Context End-To-End ASR with Large-Context Knowledge Distillation

Large-Context Conversational Representation Learning: Self-Supervised Learning for Conversational Documents

Hierarchical Transformer-based Large-Context End-to-end ASR with Large-Context Knowledge Distillation

Contact Info

Product

Resources

About