On the hidden treasure of dialog in video question answering

Engin, Deniz; Schnitzler, François; Duong, Ngoc Q. K.; Avrithis, Yannis

doi:10.1109/iccv48922.2021.00207

Cited by 7 publications

(5 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our approach to collecting episodic experiance for robot cognition draws inspiration from Deniz Engin et al's [5] method(All modality involving vision,speech and language is considered) but takes a distinct route by treating speech recognition and video captioning as separate modalities. This allows us to leverage both speech and dialogue information independently, enhancing comprehension without requiring manual annotation or model training.…”

Section: Methodsmentioning

confidence: 99%

“…These datasets contain questions with multiple choices, sourced from dialogues and visual scenes, necessitating models to engage in integrated reasoning for accurate answers. Studies by Deniz Engin et al [5] and Noa Garcia et al [6] have predominantly focused on understanding dialogues and visual scenes. Both studies have shown promising outcomes on these datasets, showcasing high accuracy in addressing questions related to knowledge, visual understanding, and temporal aspects.…”

Section: Episodic Memorymentioning

confidence: 99%

See 1 more Smart Citation

Episodic Question Answering for Cognitive Agents

Singh,

Seshadri

2024

Preprint

View full text Add to dashboard Cite

As the demand for artificial intelligence robots and cognitive agents increases, it becomes essential for these agents to comprehend previous encounters and respond to inquiries based on their past experiences. In essence, they need to maintain their memories in an episodic manner. This paper presents a novel approach to address this demand by leveraging the real-life experiences of robots to enrich their knowledge base. To achieve this goal, we employ diverse artificial intelligence techniques, including computer vision, multimodal cross embeddings, speech processing, and generative AI. These methods are utilised to establish a knowledge base that functions as memories for an agent, enabling it to maintain a memory akin to that of a human.To ensure comprehensive memory retention, an agent encounters diverse scenarios such as interacting with individuals, observing conversations, and visiting various locations. To maintain a robust visual and linguistic knowledge base encompassing these experiences, we employ techniques like scene graphs, along with the aforementioned AI methodologies. Existing approaches that involve understanding language and vision used in problem statements such as video question answering, dialogue understanding, or world understanding often overlook the temporal order in which events are observed by the agent or may be restricted to the set of characters or the world in which it has been trained. They struggle to effectively retrieve memories and generate meaningful answers based on this chronological context or in cases where we may rely on a number of past experiences which may reason an event that happened in the future. So, in our study, we've worked on building a solid knowledge base and a way for an agent to remember and link events, just like people do.In conclusion, our work aims to make AI more like humans by helping agents remember and understand events better. This could lead to smarter AI systems that adapt well to different situations in the real world.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Episodic Memorymentioning

confidence: 99%

Episodic Question Answering for Cognitive Agents

Singh,

Seshadri

2024

Preprint

View full text Add to dashboard Cite

show abstract

“…KnowIT is a knowledge-based VQA dataset with 24,282 humanannotated question-answer pairs . We compare i-Code with DiagSumQA (Engin et al 2021), variants of knowledge based VQA models ROCK and ROLL . As shown in Table 6, i-Code sets the new state-of-the-art.…”

Section: Video Question and Answeringmentioning

confidence: 99%

i-Code: An Integrative and Composable Multimodal Learning Framework

Yang

Fang

et al. 2023

AAAI

View full text Add to dashboard Cite

Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview. Most current pretraining methods, however, are limited to one or two modalities. We present i-Code, a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations. In this framework, data from each modality are first given to pretrained single-modality encoders. The encoder outputs are then integrated with a multimodal fusion network, which uses novel merge- and co-attention mechanisms to effectively combine information from the different modalities. The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning. Unlike previous research using only video for pretraining, the i-Code framework can dynamically process single, dual, and triple-modality data during training and inference, flexibly projecting different combinations of modalities into a single representation space. Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five multimodal understanding tasks and single-modality benchmarks, improving by as much as 11% and demonstrating the power of integrative multimodal pretraining.

show abstract

“…We compare i-Code with DiagSumQA (Engin et al, 2021), variants of knowledge based VQA models ROCK and ROLL . Results are summarized in Table 6 and i-Code achieves the state-of-the-art results on the KnowIT dataset.…”

Section: Video Question and Answeringmentioning

confidence: 99%

i-Code: An Integrative and Composable Multimodal Learning Framework

Yang¹,

Fang²,

C³

et al. 2022

Preprint

View full text Add to dashboard Cite

Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview. Most current pretraining methods, however, are limited to one or two modalities. We present i-Code, a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations. In this framework, data from each modality are first given to pretrained single-modality encoders. The encoder outputs are then integrated with a multimodal fusion network, which uses novel attention mechanisms and other architectural innovations to effectively combine information from the different modalities. The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning. Unlike previous research using only video for pretraining, the i-Code framework can dynamically process single, dual, and triple-modality data during training and inference, flexibly projecting different combinations of modalities into a single representation space. Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11% and demonstrating the power of integrative multimodal pretraining.

show abstract

On the hidden treasure of dialog in video question answering

Cited by 7 publications

References 28 publications

Episodic Question Answering for Cognitive Agents

Episodic Question Answering for Cognitive Agents

i-Code: An Integrative and Composable Multimodal Learning Framework

i-Code: An Integrative and Composable Multimodal Learning Framework

Contact Info

Product

Resources

About