Guided Dialog Policy Learning: Reward Estimation for Multi-Domain Task-Oriented Dialog

Takanobu, Ryuichi; Zhu, Hanlin; Huang, Minlie

doi:10.18653/v1/d19-1010

Cited by 57 publications

(61 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The researchers generally hire human users on the crowdsourcing platform, and human evaluation can be conducted in the following two ways. One is indirect evaluation that asking the annotators to read the simulated dialog between the dialog system and the user simulator, then rate the score [39] or give their preference among different systems [65] according to each metric. The other one is direct evaluation that the participants are asked to interact with the system to complete a certain task, give their ratings on the interaction experience.…”

Section: Human Evaluationmentioning

confidence: 99%

See 1 more Smart Citation

Recent advances and challenges in task-oriented dialog systems

Zhang

Takanobu

Huang

et al. 2020

Sci. China Technol. Sci.

Self Cite

120

View full text Add to dashboard Cite

Due to the significance and value in human-computer interaction and natural language processing, task-oriented dialog systems are attracting more and more attention in both academic and industrial communities. In this paper, we survey recent advances and challenges in task-oriented dialog systems. We also discuss three critical topics for task-oriented dialog systems: (1) improving data efficiency to facilitate dialog modeling in low-resource settings, (2) modeling multi-turn dynamics for dialog policy learning to achieve better task-completion performance, and (3) integrating domain ontology knowledge into the dialog model. Besides, we review the recent progresses in dialog evaluation and some widely-used corpora. We believe that this survey, though incomplete, can shed a light on future research in task-oriented dialog systems. task-oriented dialog systems, natural language understanding, dialog policy, dialog state tracking, natural language generation

show abstract

Section: Human Evaluationmentioning

confidence: 99%

“…Instead of estimating the reward signals through annotated labels, Inverse RL (IRL) aims to recover the reward function by observing expert demonstrations. Adversarial learning is often adopted for dialog reward estimation through distinguishing simulated and real user dialogs [64,65,95].…”

Section: User Goal Estimationmentioning

confidence: 99%

Recent advances and challenges in task-oriented dialog systems

Zhang

Takanobu

Huang

et al. 2020

Sci. China Technol. Sci.

Self Cite

120

View full text Add to dashboard Cite

show abstract

“…apply RL to optimize dialogue systems; in particular, they optimize handcrafted reward signals such as ease of answering, information flow, and semantic coherence. A number of RL methods, including Q-learning (Peng et al, 2017;Lipton et al, 2018;Li et al, 2017a;Su et al, 2018) and policy gradient methods (Dhingra et al, 2016;Williams et al, 2017;Takanobu et al, 2019), have been applied to optimize dialogue policies by interacting with real users or user simulators. With the help of RL, the dialogue agent is able to explore contexts that may not exist in previously observed data.…”

Section: Optimizing Interactive Systemsmentioning

confidence: 99%

“…Monfort et al (2015) use IRL to predict human motion when interacting with the environment. IRL has also been applied to dialogues to extract the reward function and model the user (Pietquin, 2013;Takanobu et al, 2019;Li et al, 2020Li et al, , 2019. IRL is used to model user behavior in order to make predictions about it.…”

Section: Rewards For Interactive Systemsmentioning

confidence: 99%

Optimizing Interactive Systems with Data-Driven Objectives

Kiseleva²,

Agarwal³

et al. 2019

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence

View full text Add to dashboard Cite

Effective optimization is essential for real world interactive systems to provide a satisfactory user experience in response to changing user behavior. However, it is often challenging to find an objective to optimize for interactive systems (e.g., policy learning in task-oriented dialog systems). Generally, such objectives are manually crafted and rarely capture complex user needs in an accurate manner. We propose an approach that infers the objective directly from observed user interactions. These inferences can be made regardless of prior knowledge and across different types of user behavior. We introduce Interactive System Optimizer (ISO), a novel algorithm that uses these inferred objectives for optimization. Our main contribution is a new general principled approach to optimizing interactive systems using data-driven objectives. We demonstrate the high effectiveness of ISO over several simulations.

show abstract

“…The core of SDS, dialogue management, can be formulated as an RL problem (Levin et al, 1997;Young et al, 2013;Williams, 2008). Great advancements can be achieved with deep RL algorithms (Dhingra et al, 2016;Chang et al, 2017;Takanobu et al, 2019;Wu et al, 2020). Yet, deep RL methods are notoriously expensive in terms of the number of interactions they require.…”

Section: Introductionmentioning

confidence: 99%

Actor-Double-Critic: Incorporating Model-Based Critic for Task-Oriented Dialogue Systems

Tseng

Gašić

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

In order to improve the sample-efficiency of deep reinforcement learning (DRL), we implemented imagination augmented agent (I2A) in spoken dialogue systems (SDS). Although I2A achieves a higher success rate than baselines by augmenting predicted future into a policy network, its complicated architecture introduces unwanted instability. In this work, we propose actor-double-critic (ADC) to improve the stability and overall performance of I2A. ADC simplifies the architecture of I2A to reduce excessive parameters and hyperparameters. More importantly, a separate model-based critic shares parameters between actions and makes back-propagation explicit.In our experiments on Cambridge Restaurant Booking task, ADC enhances success rates considerably and shows robustness to imperfect environment models. In addition, ADC exhibits the stability and sample-efficiency as significantly reducing the baseline standard deviation of success rates and reaching the 80% success rate with half training data.

show abstract

Guided Dialog Policy Learning: Reward Estimation for Multi-Domain Task-Oriented Dialog

Abstract: Dialog policy decides what and how a taskoriented dialog system will respond, and plays a vital role in delivering effective conversations.

Cited by 57 publications

References 27 publications

Recent advances and challenges in task-oriented dialog systems

Recent advances and challenges in task-oriented dialog systems

Optimizing Interactive Systems with Data-Driven Objectives

Actor-Double-Critic: Incorporating Model-Based Critic for Task-Oriented Dialogue Systems

Contact Info

Product

Resources

About