Huda Alamri scite author profile

We introduce the task of scene-aware dialog. Our goal is to generate a complete and natural response to a question about a scene, given video and audio of the scene and the history of previous turns in the dialog. To answer successfully, agents must ground concepts from the question in the video while leveraging contextual cues from the dialog history. To benchmark this task, we introduce the Audio Visual Scene-Aware Dialog (AVSD) Dataset. For each of more than 11,000 videos of human actions from the Charades dataset, our dataset contains a dialog about the video, plus a final summary of the video by one of the dialog participants. We train several baseline systems for this task and evaluate the performance of the trained models using both qualitative and quantitative metrics. Our results indicate that models must utilize all the available inputs (video, audio, question, and dialog history) to perform best on this dataset.

show abstract

End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features

Hori

Alamri

Wang

et al. 2019

View full text Add to dashboard Cite

Dialog systems need to understand dynamic visual scenes in order to have conversations with users about the objects and events around them. Scene-aware dialog systems for real-world applications could be developed by integrating state-ofthe-art technologies from multiple research areas, including: end-to-end dialog technologies, which generate system responses using models trained from dialog data; visual question answering (VQA) technologies, which answer questions about images using learned image features; and video description technologies, in which descriptions/captions are generated from videos using multimodal information. We introduce a new dataset of dialogs about videos of human behaviors. Each dialog is a typed conversation that consists of a sequence of 10 question-and-answer (QA) pairs between two Amazon Mechanical Turk (AMT) workers. In total, we collected dialogs on ∼ 9, 000 videos. Using this new dataset, we trained an end-toend conversation model that generates responses in a dialog about a video. Our experiments demonstrate that using multimodal features that were developed for multimodal attention-based video description enhances the quality of generated dialog about dynamic scenes (videos). Our dataset, model code and pretrained models will be publicly available for a new Video Scene-Aware Dialog challenge.

show abstract

Dialog System Technology Challenge 7

Yoshino¹,

Hori²,

Pérez³

et al. 2019

Preprint

View full text Add to dashboard Cite

This paper introduces the Seventh Dialog System Technology Challenges (DSTC), which use shared datasets to explore the problem of building dialog systems. Recently, end-to-end dialog modeling approaches have been applied to various dialog tasks. The seventh DSTC (DSTC7) focuses on developing technologies related to end-to-end dialog systems for (1) sentence selection, (2) sentence generation and (3) audio visual scene aware dialog. This paper summarizes the overall setup and results of DSTC7, including detailed descriptions of the different tracks and provided datasets. We also describe overall trends in the submitted systems and the key results. Each track introduced new datasets and participants achieved impressive results using state-of-the-art end-to-end technologies. * Every author has equal contribution. http://workshop.colips.org/dstc7 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada.

show abstract

End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features

Hori¹,

Alamri²,

Wang³

et al. 2018

Preprint

View full text Add to dashboard Cite

A New Approach for Segmentation and Recognition of Arabic Handwritten Touching Numeral Pairs

Alamri

Suen

2009

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Huda Alamri

Audio Visual Scene-Aware Dialog

End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features

Dialog System Technology Challenge 7

End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features

A New Approach for Segmentation and Recognition of Arabic Handwritten Touching Numeral Pairs

Contact Info

Product

Resources

About