Human-centric dialog training via offline reinforcement learning

Jaques, Natasha; Shen, Judy Hanwen; Ghandeharioun, Asma; Ferguson, Craig; Lapedriza, Àgata; Jones, Noah; Gu, Shixiang; Picard, Rosalind W.

doi:10.18653/v1/2020.emnlp-main.327

Cited by 23 publications

(28 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The key advantage of off-policy updates is that samples from other sources, such as human-written text, can be used, making them more data efficient than on-policy methods. Previous work has used either importance weighted PG [42,68,25] or Q-learning based algorithms [16,23,38]. However, the off-policy methods have been considered to be less stable.…”

Section: (Static) Training Datamentioning

confidence: 99%

“…For example, the Q-learning performance relies heavily on how accurate the learned Q-function assesses the quality of intermediate subsequences -a challenging task due to the sparse reward signals (e.g., reward is received only after the whole sequence is generated). Further, previous work has largely focused on the extreme of using only off-policy data, mostly for offline training of chatbots [23]. As a result, the opportunity of directly improving the reward (as in on-policy updates) for other rich tasks is missed.…”

Section: (Static) Training Datamentioning

confidence: 99%

See 1 more Smart Citation

Text Generation with Efficient (Soft) Q-Learning

Guo¹,

Tan²,

Liu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Maximum likelihood estimation (MLE) is the predominant algorithm for training text generation models. This paradigm relies on direct supervision examples, which is not applicable to many applications, such as generating adversarial attacks or generating prompts to control language models. Reinforcement learning (RL) on the other hand offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward. Yet previous RL algorithms for text generation, such as policy gradient (on-policy RL) and Q-learning (off-policy RL), are often notoriously inefficient or unstable to train due to the large sequence space and the sparse reward received only at the end of sequences. In this paper, we introduce a new RL formulation for text generation from the soft Q-learning perspective. It further enables us to draw from the latest RL advances, such as path consistency learning, to combine the best of on-/off-policy updates, and learn effectively from sparse reward. We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation. Experiments show our approach consistently outperforms both task-specialized algorithms and the previous RL methods. On standard supervised tasks where MLE prevails, our approach also achieves competitive performance and stability by training text generation from scratch. 1 1 Code is released at https://github.com/HanGuo97/soft-Q-learning-for-text-generation.Preprint. Under review.

show abstract

Section: (Static) Training Datamentioning

confidence: 99%

Section: (Static) Training Datamentioning

confidence: 99%

Text Generation with Efficient (Soft) Q-Learning

Guo¹,

Tan²,

Liu³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Explicit human feedback has also been incorporated into reinforcement learning methods (Knox and Stone, 2009;Pilarski et al, 2011;Daniel et al, 2015;Mathewson and Pilarski, 2016;Warnell et al, 2018;MacGlashan et al, 2017;Arumugam et al, 2019), including in the context of dialogue system learning (Liu et al, 2018). Jaques et al (2020) study forming a reward from implicit feedback for non-task-oriented dialogue language generation, by training multiple models to detect linguistic signals, such as sentiment and lexical overlap, that correlate with explicit user feedback. Learning from users has also been studied by asking users to rank system outputs (e.g., Wilson et al, 2012;Christiano et al, 2017), including for instruction following (Wang et al, 2016) and summarization (Stiennon et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

Continual Learning for Grounded Instruction Generation by Observing Human Following Behavior

Kojima¹,

Suhr²,

Artzi³

2021

Transactions of the Association for Computational Linguistics

View full text Add to dashboard Cite

We study continual learning for natural language instruction generation, by observing human users’ instruction execution. We focus on a collaborative scenario, where the system both acts and delegates tasks to human users using natural language. We compare user execution of generated instructions to the original system intent as an indication to the system’s success communicating its intent. We show how to use this signal to improve the system’s ability to generate instructions via contextual bandit learning. In interaction with real users, our system demonstrates dramatic improvements in its ability to generate language over time.

show abstract

“…Prior works tackle this problem by ensuring that the learned policy stays "close" to the behavior policy via behavior regularization. This is achieved either by explicit constraints on the learned policy to only select actions where (s, a) has sufficient support under the behavior distribution (Fujimoto, Meger, and Precup 2019;Ghasemipour, Schuurmans, and Gu 2021); or by adding a regularization term that calculates some divergence metrics between the learned policy and the behavior policy (Wu, Tucker, and Nachum 2019;Siegel et al 2019;Zhang, Kuppannagari, and Prasanna 2020;Dadashi et al 2021), e.g., KL divergence (Jaques et al 2020) or Maximum Mean Discrepancy (MMD) (Kumar et al 2019). While straightforward, these methods lack guaranteed performance improvement against the behavior policy.…”

Section: Introductionmentioning

confidence: 99%

Offline Reinforcement Learning with Soft Behavior Regularization

Xu,

Zhan,

et al. 2021

Preprint

View full text Add to dashboard Cite

Most prior approaches to offline reinforcement learning (RL) utilize behavior regularization, typically augmenting existing off-policy actor critic algorithms with a penalty measuring divergence between the policy and the offline data. However, these approaches lack guaranteed performance improvement over the behavior policy. In this work, we start from the performance difference between the learned policy and the behavior policy, we derive a new policy learning objective that can be used in the offline setting, which corresponds to the advantage function value of the behavior policy, multiplying by a statemarginal density ratio. We propose a practical way to compute the density ratio and demonstrate its equivalence to a statedependent behavior regularization. Unlike state-independent regularization used in prior approaches, this soft regularization allows more freedom of policy deviation at high confidence states, leading to better performance and stability. We thus term our resulting algorithm Soft Behavior-regularized Actor Critic (SBAC). Our experimental results show that SBAC matches or outperforms the state-of-the-art on a set of continuous control locomotion and manipulation tasks.

show abstract

Human-centric dialog training via offline reinforcement learning

Cited by 23 publications

References 43 publications

Text Generation with Efficient (Soft) Q-Learning

Text Generation with Efficient (Soft) Q-Learning

Continual Learning for Grounded Instruction Generation by Observing Human Following Behavior

Offline Reinforcement Learning with Soft Behavior Regularization

Contact Info

Product

Resources

About