Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.327
|View full text |Cite
|
Sign up to set email alerts
|

Human-centric dialog training via offline reinforcement learning

Abstract: How can we train a dialog model to produce better conversations by learning from human feedback, without the risk of humans teaching it harmful chat behaviors? We start by hosting models online, and gather human feedback from real-time, open-ended conversations, which we then use to train and improve the models using offline reinforcement learning (RL). We identify implicit conversational cues including language similarity, elicitation of laughter, sentiment, and more, which indicate positive human feedback, a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
13
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 23 publications
(28 citation statements)
references
References 43 publications
0
13
0
Order By: Relevance
“…The key advantage of off-policy updates is that samples from other sources, such as human-written text, can be used, making them more data efficient than on-policy methods. Previous work has used either importance weighted PG [42,68,25] or Q-learning based algorithms [16,23,38]. However, the off-policy methods have been considered to be less stable.…”
Section: (Static) Training Datamentioning
confidence: 99%
See 1 more Smart Citation
“…The key advantage of off-policy updates is that samples from other sources, such as human-written text, can be used, making them more data efficient than on-policy methods. Previous work has used either importance weighted PG [42,68,25] or Q-learning based algorithms [16,23,38]. However, the off-policy methods have been considered to be less stable.…”
Section: (Static) Training Datamentioning
confidence: 99%
“…For example, the Q-learning performance relies heavily on how accurate the learned Q-function assesses the quality of intermediate subsequences -a challenging task due to the sparse reward signals (e.g., reward is received only after the whole sequence is generated). Further, previous work has largely focused on the extreme of using only off-policy data, mostly for offline training of chatbots [23]. As a result, the opportunity of directly improving the reward (as in on-policy updates) for other rich tasks is missed.…”
Section: (Static) Training Datamentioning
confidence: 99%
“…Explicit human feedback has also been incorporated into reinforcement learning methods (Knox and Stone, 2009;Pilarski et al, 2011;Daniel et al, 2015;Mathewson and Pilarski, 2016;Warnell et al, 2018;MacGlashan et al, 2017;Arumugam et al, 2019), including in the context of dialogue system learning (Liu et al, 2018). Jaques et al (2020) study forming a reward from implicit feedback for non-task-oriented dialogue language generation, by training multiple models to detect linguistic signals, such as sentiment and lexical overlap, that correlate with explicit user feedback. Learning from users has also been studied by asking users to rank system outputs (e.g., Wilson et al, 2012;Christiano et al, 2017), including for instruction following (Wang et al, 2016) and summarization (Stiennon et al, 2020).…”
Section: Related Workmentioning
confidence: 99%
“…Prior works tackle this problem by ensuring that the learned policy stays "close" to the behavior policy via behavior regularization. This is achieved either by explicit constraints on the learned policy to only select actions where (s, a) has sufficient support under the behavior distribution (Fujimoto, Meger, and Precup 2019;Ghasemipour, Schuurmans, and Gu 2021); or by adding a regularization term that calculates some divergence metrics between the learned policy and the behavior policy (Wu, Tucker, and Nachum 2019;Siegel et al 2019;Zhang, Kuppannagari, and Prasanna 2020;Dadashi et al 2021), e.g., KL divergence (Jaques et al 2020) or Maximum Mean Discrepancy (MMD) (Kumar et al 2019). While straightforward, these methods lack guaranteed performance improvement against the behavior policy.…”
Section: Introductionmentioning
confidence: 99%