A Generative Model for Joint Natural Language Understanding and Generation

Tseng, Bo-Hsiang; Cheng, Jianpeng; Fang, Yimai; Vandyke, David

doi:10.18653/v1/2020.acl-main.163

Cited by 22 publications

(35 citation statements)

References 33 publications

(41 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Task-Oriented Dialogue Models: Most taskoriented dialogue systems break down the task into three components: belief tracking (Henderson et al, 2013;Mrkšić et al, 2016;Rastogi et al, 2017;Nouri and Hosseini-Asl, 2018;Wu et al, 2019a;Zhou and Small, 2019;Heck et al, 2020), dialogue act prediction (Wen et al, 2017a;Tanaka et al, 2019), and response generation Budzianowski et al, 2018;Lippe et al, 2020). Traditionally, a modular approach is adopted, where these components are optimized independently (i.e., a pipeline design) or learned via multi-task learning (i.e., some parameters are shared among the components) (Wen et al, 2017b;Zhao et al, 2019;Mehri et al, 2019;Tseng et al, 2020;. However, it is known that improvements in one component do not necessarily lead to overall performance improvements (Ham et al, 2020), and the modular approach suffers from error propagation in practice (Liu and Lane, 2018).…”

Section: Related Workmentioning

confidence: 99%

Pretraining the Noisy Channel Model for Task-Oriented Dialogue

Liu

Rimell

et al. 2021

Transactions of the Association for Computational Linguistics

View full text Add to dashboard Cite

Direct decoding for task-oriented dialogue is known to suffer from the explaining-away effect, manifested in models that prefer short and generic responses. Here we argue for the use of Bayes’ theorem to factorize the dialogue task into two models, the distribution of the context given the response, and the prior for the response itself. This approach, an instantiation of the noisy channel model, both mitigates the explaining-away effect and allows the principled incorporation of large pretrained models for the response prior. We present extensive experiments showing that a noisy channel model decodes better responses compared to direct decoding and that a two-stage pretraining strategy, employing both open-domain and task-oriented dialogue data, improves over randomly initialized models.

show abstract

Section: Related Workmentioning

confidence: 99%

Pretraining the Noisy Channel Model for Task-Oriented Dialogue

Liu

Rimell

et al. 2021

Transactions of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…Learning with Semi-Supervision. Work on semi-supervised learning considers settings with some labeled data and a much larger set of unlabeled data, and then leverages both labeled the unlabeled data as in machine translation (Artetxe et al, 2017;Lample et al, 2017), data-to-text generation (Schmitt and Schütze, 2019;Qader et al, 2019) or more relevantly the joint learning framework for training NLU and NLG (Tseng et al, 2020;. Nonetheless, these approaches all assume that a large collection of text is available, which is an unrealistic assumption for the task due to the need for expert curation.…”

Section: Related Workmentioning

confidence: 99%

“…Natural language generation (NLG) is the task that transforms meaning representations (MR) into natural language descriptions (Reiter and Dale, 2000; Barzilay and Lapata, 2005); while natural language understanding (NLU) is the opposite process where text is converted into MR (Zhang and Wang, 2016). These two processes can thus constrain each other -recent exploration of the duality of neural natural language generation (NLG) and understanding (NLU) has led to successful semi-supervised learning techniques where both labeled and unlabeled data can be used for training Tseng et al, 2020;Schmitt and Schütze, 2019;Qader et al, 2019;.…”

Section: Introductionmentioning

confidence: 99%

“…On the other hand, learning with weak supervision from noisy labels offers a potential solution as it automatically builds imperfect training sets from low cost labeling rules or pretrained models (Zhou, 2018;Fries et al, 2020). Further, labeled data and large unlabeled data can be utilized in semi-supervised learning (Lample et al, 2017;Tseng et al, 2020), as a way to jointly improve both NLU and NLG models.…”

Section: Introductionmentioning

confidence: 99%

“…These noisy text labels are generated by a weak annotator, which is built upon recent works that directly fine-tune GPT-2 (Radford et al, 2019) on joint meaning representation (MR) and text (Mager et al, 2020;Harkous et al, 2020). Then, we jointly train the NLG and NLU models in a two-step process with semisupervised learning objectives (Tseng et al, 2020). First, we use pretrained models to estimate quality scores for each sample.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Jointly Improving Language Understanding and Generation with Quality-Weighted Weak Supervision of Automatic Labeling

Chang¹,

Demberg²,

Marin³

2021

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

View full text Add to dashboard Cite

Neural natural language generation (NLG) and understanding (NLU) models are data-hungry and require massive amounts of annotated data to be competitive. Recent frameworks address this bottleneck with generative models that synthesize weak labels at scale, where a small amount of training labels are expertcurated and the rest of the data is automatically annotated. We follow that approach, by automatically constructing a large-scale weaklylabeled data with a fine-tuned GPT-2, and employ a semi-supervised framework to jointly train the NLG and NLU models. The proposed framework adapts the parameter updates to the models according to the estimated labelquality. On both the E2E and Weather benchmarks, we show that this weakly supervised training paradigm is an effective approach under low resource scenarios with as little as 10 data instances, and outperforming benchmark systems on both datasets when 100% of training data is used.

show abstract