2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2023
DOI: 10.1109/wacv56688.2023.00442
|View full text |Cite
|
Sign up to set email alerts
|

Watching the News: Towards VideoQA Models that can Read

Abstract: Researchers have extensively studied the field of vision and language, discovering that both visual and textual content is crucial for understanding scenes effectively. Particularly, comprehending text in videos holds great significance, requiring both scene text understanding and temporal reasoning. This paper focuses on exploring two recently introduced datasets, NewsVideoQA and M4-ViteVQA, which aim to address video question answering based on textual content. The NewsVideoQA dataset contains questionanswer… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(1 citation statement)
references
References 39 publications
0
1
0
Order By: Relevance
“…Literature [2] and Literature [3] proposed integrating textual content into VQA, forming the task of TextVQA (Text Visual Question Answering) and ST-VQA (Scenario Text Visual Question Answering) respectively, along with the construction of benchmark datasets. Fig 2 illustrates an example of the ST-VQA task, where the questions are related to the scenario text in the image, requiring the model to establish unified collaborations between the question, visual targets, and scenario text to generate correct answers.…”
Section: Introductionmentioning
confidence: 99%
“…Literature [2] and Literature [3] proposed integrating textual content into VQA, forming the task of TextVQA (Text Visual Question Answering) and ST-VQA (Scenario Text Visual Question Answering) respectively, along with the construction of benchmark datasets. Fig 2 illustrates an example of the ST-VQA task, where the questions are related to the scenario text in the image, requiring the model to establish unified collaborations between the question, visual targets, and scenario text to generate correct answers.…”
Section: Introductionmentioning
confidence: 99%