2022
DOI: 10.48550/arxiv.2203.01225
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Video Question Answering: Datasets, Algorithms and Challenges

Abstract: Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. It has earned increasing attention with recent research trends in joint vision and language understanding. Yet, compared with Im-ageQA, VideoQA is largely underexplored and progresses slowly. Although different algorithms have continually been proposed and shown success on different VideoQA datasets, we find that there lacks a meaningful survey to categorize them, which seriously impedes its advancements… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
9
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2

Relationship

2
4

Authors

Journals

citations
Cited by 6 publications
(9 citation statements)
references
References 9 publications
0
9
0
Order By: Relevance
“…In recent years, the VideoQA paradigm has mainly followed a three-step process (Zhong et al 2022): firstly, extracting features using pre-trained models; secondly, performing feature interaction between videos and questions; and finally, classifying answers in open-ended manner. Typically, pretrained models in the computer vision field, such as ResNet (He et al 2016) and ResNeXt (Hara, Kataoka, and Satoh 2018), are used to extract video features, while word embedding vector like Glove (Pennington, Socher, and Manning 2014) or pre-trained models such as Bert (Devlin et al 2018) are used to extract question features in natural language processing.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…In recent years, the VideoQA paradigm has mainly followed a three-step process (Zhong et al 2022): firstly, extracting features using pre-trained models; secondly, performing feature interaction between videos and questions; and finally, classifying answers in open-ended manner. Typically, pretrained models in the computer vision field, such as ResNet (He et al 2016) and ResNeXt (Hara, Kataoka, and Satoh 2018), are used to extract video features, while word embedding vector like Glove (Pennington, Socher, and Manning 2014) or pre-trained models such as Bert (Devlin et al 2018) are used to extract question features in natural language processing.…”
Section: Related Workmentioning
confidence: 99%
“…Video Question Answering (VideoQA) is a challenging task within the multimodal learning domain, aiming to understand videos and answer questions (Zhong et al 2022).…”
Section: Introductionmentioning
confidence: 99%
“…She is walking to the empty fridge because she mistakenly thinks that the food is in the fridge (belief) and hold a false belief about the food. significantly improve a model's comprehension of videos in complex scenes (Zhong et al 2022;Zellers et al 2019). Since ToM reasoning on VideoQA requires to infer hidden information and relationships related to human understanding with the simultaneous verification of multiple skills as well as integrating visual and auditory information.…”
Section: Introductionmentioning
confidence: 99%
“…This advancement stems, in part from the success of multi-modal pretraining on web-scale vision-text data [8,21,31,34,38,44,52,53,54,63], and in part from the unified deep neural network that can well model both vision and natural language data, i.e., transformer [55]. As a typical multi-disciplinary AI task, Video Question Answering (VideoQA) has benefited a lot from these developments which helps to propel the field steadily forward over the use of purely conventional techniques [14,16,20,23,28,60,71].…”
Section: Introductionmentioning
confidence: 99%