2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
DOI: 10.1109/iccv48922.2021.00207
|View full text |Cite
|
Sign up to set email alerts
|

On the hidden treasure of dialog in video question answering

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(5 citation statements)
references
References 28 publications
0
5
0
Order By: Relevance
“…Our approach to collecting episodic experiance for robot cognition draws inspiration from Deniz Engin et al's [5] method(All modality involving vision,speech and language is considered) but takes a distinct route by treating speech recognition and video captioning as separate modalities. This allows us to leverage both speech and dialogue information independently, enhancing comprehension without requiring manual annotation or model training.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Our approach to collecting episodic experiance for robot cognition draws inspiration from Deniz Engin et al's [5] method(All modality involving vision,speech and language is considered) but takes a distinct route by treating speech recognition and video captioning as separate modalities. This allows us to leverage both speech and dialogue information independently, enhancing comprehension without requiring manual annotation or model training.…”
Section: Methodsmentioning
confidence: 99%
“…These datasets contain questions with multiple choices, sourced from dialogues and visual scenes, necessitating models to engage in integrated reasoning for accurate answers. Studies by Deniz Engin et al [5] and Noa Garcia et al [6] have predominantly focused on understanding dialogues and visual scenes. Both studies have shown promising outcomes on these datasets, showcasing high accuracy in addressing questions related to knowledge, visual understanding, and temporal aspects.…”
Section: Episodic Memorymentioning
confidence: 99%
“…KnowIT is a knowledge-based VQA dataset with 24,282 humanannotated question-answer pairs . We compare i-Code with DiagSumQA (Engin et al 2021), variants of knowledge based VQA models ROCK and ROLL . As shown in Table 6, i-Code sets the new state-of-the-art.…”
Section: Video Question and Answeringmentioning
confidence: 99%
“…We compare i-Code with DiagSumQA (Engin et al, 2021), variants of knowledge based VQA models ROCK and ROLL . Results are summarized in Table 6 and i-Code achieves the state-of-the-art results on the KnowIT dataset.…”
Section: Video Question and Answeringmentioning
confidence: 99%