Learning to Retrieve Videos by Asking Questions

Madasu, Avinash; Oliva, Junier B.; Bertasius, Gedas

doi:10.1145/3503161.3548361

Cited by 8 publications

(1 citation statement)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With the progress in AI, the integration of information from different modalities, such as text, image, audio, and video, has been known to provide complete information for building effective end-to-end dialogue systems [58,27,39,37] by bringing the different areas of computer vision (CV) and natural language processing (NLP) together. Hence, a multimodal dialogue system bridges the gap between vision and language, ensuring interdisciplinary research.…”

Section: Introductionmentioning

confidence: 99%

Aspect-Aware Response Generation for Multimodal Dialogue System

Firdaus

Thakur

Ekbal

2021

ACM Trans. Intell. Syst. Technol.

View full text Add to dashboard Cite

Multimodality in dialogue systems has opened up new frontiers for the creation of robust conversational agents. Any multimodal system aims at bridging the gap between language and vision by leveraging diverse and often complementary information from image, audio, and video, as well as text. For every task-oriented dialog system, different aspects of the product or service are crucial for satisfying the user’s demands. Based upon the aspect, the user decides upon selecting the product or service. The ability to generate responses with the specified aspects in a goal-oriented dialogue setup facilitates user satisfaction by fulfilling the user’s goals. Therefore, in our current work, we propose the task of aspect controlled response generation in a multimodal task-oriented dialog system. We employ a multimodal hierarchical memory network for generating responses that utilize information from both text and images. As there was no readily available data for building such multimodal systems, we create a Multi-Domain Multi-Modal Dialog (MDMMD++) dataset. The dataset comprises the conversations having both text and images belonging to the four different domains, such as hotels, restaurants, electronics, and furniture. Quantitative and qualitative analysis on the newly created MDMMD++ dataset shows that the proposed methodology outperforms the baseline models for the proposed task of aspect controlled response generation.

show abstract

Section: Introductionmentioning

confidence: 99%

Aspect-Aware Response Generation for Multimodal Dialogue System

Firdaus

Thakur

Ekbal

2021

ACM Trans. Intell. Syst. Technol.

View full text Add to dashboard Cite

show abstract

Dialogue-to-Video Retrieval

Lyu

Nguyen

Ninh

et al. 2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Recent years have witnessed an increasing amount of dialogue/conversation on the web especially on social media. That inspires the development of dialogue-based retrieval, in which retrieving videos based on dialogue is of increasing interest for recommendation systems. Different from other video retrieval tasks, dialogue-to-video retrieval uses structured queries in the form of user-generated dialogue as the search descriptor. We present a novel dialogue-to-video retrieval system, incorporating structured conversational information. Experiments conducted on the AVSD dataset show that our proposed approach using plain-text queries improves over the previous counterpart model by 15.8% on R@1. Furthermore, our approach using dialogue as a query, improves retrieval performance by 4.

show abstract