Reinforcement learning (RL) can enable task-oriented dialogue systems to steer the conversation towards successful task completion. In an end-to-end setting, a response can be constructed in a word-level sequential decision making process with the entire system vocabulary as action space.Policies trained in such a fashion do not require expert-defined action spaces, but they have to deal with large action spaces and long trajectories, making RL impractical. Using the latent space of a variational model as action space alleviates this problem. However, current approaches use an uninformed prior for training and optimize the latent distribution solely on the context. It is therefore unclear whether the latent representation truly encodes the characteristics of different actions. In this paper, we explore three ways of leveraging an auxiliary task to shape the latent variable distribution: via pre-training, to obtain an informed prior, and via multitask learning. We choose response auto-encoding as the auxiliary task, as this captures the generative factors of dialogue responses while requiring low computational cost and neither additional data nor labels. Our approach yields a more action-characterized latent representations which support end-to-end dialogue policy optimization and achieves state-of-the-art success rates. These results warrant a more wide-spread use of RL in end-to-end dialogue models.
Task-oriented dialog systems rely on dialog state tracking (DST) to monitor the user's goal during the course of an interaction. Multidomain and open-vocabulary settings complicate the task considerably and demand scalable solutions. In this paper we present a new approach to DST which makes use of various copy mechanisms to fill slots with values. Our model has no need to maintain a list of candidate values. Instead, all values are extracted from the dialog context on-thefly. A slot is filled by one of three copy mechanisms: (1) Span prediction may extract values directly from the user input; (2) a value may be copied from a system inform memory that keeps track of the system's inform operations; (3) a value may be copied over from a different slot that is already contained in the dialog state to resolve coreferences within and across domains. Our approach combines the advantages of span-based slot filling methods with memory methods to avoid the use of value picklists altogether. We argue that our strategy simplifies the DST task while at the same time achieving state of the art performance on various popular evaluation sets including Mul-tiWOZ 2.1, where we achieve a joint goal accuracy beyond 55%.
An emotionally-competent computer agent could be a valuable assistive technology in performing various affective tasks. For example caring for the elderly, low-cost ubiquitous chat therapy, and providing emotional support in general, by promoting a more positive emotional state through dialogue system interaction. However, despite the increase of interest in this task, existing works face a number of shortcomings: system scalability, restrictive modeling, and weak emphasis on maximizing user emotional experience. In this paper, we build a fully data driven chat-oriented dialogue system that can dynamically mimic affective human interactions by utilizing a neural network architecture. In particular, we propose a sequence-to-sequence response generator that considers the emotional context of the dialogue. An emotion encoder is trained jointly with the entire network to encode and maintain the emotional context throughout the dialogue. The encoded emotion information is then incorporated in the response generation process. We train the network with a dialogue corpus that contains positive-emotion eliciting responses, collected through crowd-sourcing. Objective evaluation shows that incorporation of emotion into the training process helps reduce the perplexity of the generated responses, even when a small dataset is used. Subsequent subjective evaluation shows that the proposed method produces responses that are more natural and likely to elicit a more positive emotion.
In this paper we present Indonesian Emotional Speech Corpus (IDESC), the first ever corpus in Indonesian that contains various emotion contents. As interaction between human and computer makes its way to the most natural form possible, it becomes more and more urgent to incorporate emotion in the equation. However, in Indonesian, this aspect is yet to be explored. The acquisition of an emotion corpus serves as a foundation in further research regarding the subject. In constructing IDESC, we aim at natural and real emotion that is applicable to humancomputer interaction. The corpus consists of three episodes of Indonesian talk show in different genres: politics, humanity, and entertainment. Each episode is carefully segmented and labeled based on its emotion content, resulting in 2179 segments worth 1 hour, 34 minutes, and 49.7 seconds of speech. The corpus is still in its early stage of development, yielding exciting possibilities of future works.
Social-affective aspects of interaction play a vital role in making human communication a rich and dynamic experience. Observation of complex emotional phenomena requires rich sets of labeled data of natural interaction. Although there has been an increase of interest in constructing corpora containing social interactions, there is still a lack of spontaneous and emotionally rich corpora. This paper presents a corpus of socialaffective interactions in English and Indonesian, constructed from various television talk shows, containing natural conversations and real emotion occurrences. We carefully annotate the corpus in terms of emotion and discourse structure to allow for the aforementioned observation. The corpus is still in its early stage of development, yielding wide-ranging possibilities for future work.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.