Ordinal and Attribute Aware Response Generation in a Multimodal Dialogue System

Chauhan, Hardik; Firdaus, Mauajama; Ekbal, Asif; Bhattacharyya, Pushpak

doi:10.18653/v1/p19-1540

Cited by 29 publications

(28 citation statements)

References 29 publications

(16 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The user utterances play a significant role in building the dialogue context for generating coherent and appropriate responses, in accordance to the user demands. We focus on generating the textual responses only in a similar manner as [8,12,13]. Here, the task of multi-modal dialog generation is defined as follows: we consider both the modalities, i.e.…”

Section: Plos Onementioning

confidence: 99%

“…Earlier works on the MMD dataset reported in [12,13,44] used the hierarchical encoder-decoder model to generate responses by capturing information from text, images and the knowledge base. Recently, [8] proposed attribute and position-aware attention for generating textual responses. The authors in [45] used an hierarchical attention mechanism for generating responses on the MMD dataset.…”

Section: Multimodal Dialogue Systemsmentioning

confidence: 99%

“…In this model, we concatenate pairwise text and image features U f and I f obtained after parallel co-attention mechanism as input to the MFB module for better interaction between the modalities as used in [8].…”

Section: Model 6 (Mhred+kb+a+pca(it)+mfb(it))mentioning

confidence: 99%

“…In Table 2, we provide the results of our proposed framework in comparison to the existing state-of-the-art methods. For response generation in multi-modal systems, we compare our current work with the existing systems [8,13]. For a fair comparison with the existing systems, the data used for experimentation should be similar in terms of its structure, genre as well as annotation in order to draw correct conclusions on the results obtained.…”

Section: Comparisons To the Existing Systemsmentioning

confidence: 99%

“…Lately, several works on multimodal dialogue systems [7][8][9] have encouraged research in this direction by combining information from the different modalities, such as texts, audios, videos and images. Multimodal conversational systems provide completeness to the existing dialogue systems by providing necessary information that lacks in unimodal systems.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

More to diverse: Generating diversified responses in a task oriented multimodal dialog system

2020

Self Cite

View full text Add to dashboard Cite

Multimodal dialogue system, due to its many-fold applications, has gained much attention to the researchers and developers in recent times. With the release of large-scale multimodal dialog dataset Saha et al. 2018 on the fashion domain, it has been possible to investigate the dialogue systems having both textual and visual modalities. Response generation is an essential aspect of every dialogue system, and making the responses diverse is an important problem. For any goal-oriented conversational agent, the system’s responses must be informative, diverse and polite, that may lead to better user experiences. In this paper, we propose an end-to-end neural framework for generating varied responses in a multimodal dialogue setup capturing information from both the text and image. Multimodal encoder with co-attention between the text and image is used for focusing on the different modalities to obtain better contextual information. For effective information sharing across the modalities, we combine the information of text and images using the BLOCK fusion technique that helps in learning an improved multimodal representation. We employ stochastic beam search with Gumble Top K-tricks to achieve diversified responses while preserving the content and politeness in the responses. Experimental results show that our proposed approach performs significantly better compared to the existing and baseline methods in terms of distinct metrics, and thereby generates more diverse responses that are informative, interesting and polite without any loss of information. Empirical evaluation also reveals that images, while used along with the text, improve the efficiency of the model in generating diversified responses.

show abstract

Section: Plos Onementioning

confidence: 99%