MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering

Urooj, Aisha; Mazaheri, Amir; Lobo, Niels da Vitoria; Shah, Mubarak

doi:10.18653/v1/2020.findings-emnlp.417

Cited by 15 publications

(9 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Motivated by the success of Transformer, Li et al first introduce the architecture of Transformer without pre-training to VideoQA (PSAC), which consists of two positional selfattention blocks to replace LSTM, and a video-question coattention block to simultaneously attend both visual and textual information. [Yang et al, 2020] and [Urooj et al, 2020] incorporate the pre-trained language-based Transformer (BERT) [Devlin et al, 2019] to movie and story understanding, which requires more modeling on languages like subtitles and dialogues. Both works process each of the input modalities such as video and subtitles, with question and candidate answer, respectively, and lately fuse several streams for the final answer.…”

Section: Methodsmentioning

confidence: 99%

Video Question Answering: Datasets, Algorithms and Challenges

Zhong¹,

Ji²,

Xiao³

et al. 2022

Preprint

View full text Add to dashboard Cite

Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. It has earned increasing attention with recent research trends in joint vision and language understanding. Yet, compared with Im-ageQA, VideoQA is largely underexplored and progresses slowly. Although different algorithms have continually been proposed and shown success on different VideoQA datasets, we find that there lacks a meaningful survey to categorize them, which seriously impedes its advancements. This paper thus provides a clear taxonomy and comprehensive analyses to VideoQA, focusing on the datasets, algorithms, and unique challenges. We then point out the research trend of studying beyond factoid QA to inference QA towards the cognition of video contents, Finally, we conclude some promising directions for future exploration.

show abstract

Section: Methodsmentioning

confidence: 99%

Video Question Answering: Datasets, Algorithms and Challenges

Zhong¹,

Ji²,

Xiao³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…VideoBERT [58] , known as the first video-text pre-training model, extends the BERT model to process videos and texts simultaneously. Video-BERT uses the pre-trained ConvNet and S3D [133] to extract video features and concatenate them with textual word embeddings to feed into a transformer initialized with BERT. ConvNet and S3D are frozen when training the VideoBERT, which indicates the approach is not endto-end.…”

Section: Sota Vlp Modelsmentioning

confidence: 99%

VLP: A Survey on Vision-language Pre-training

et al. 2023

View full text Add to dashboard Cite

In the past few years, the emergence of pre-training models has brought uni-modal fields such as computer vision (CV) and natural language processing (NLP) to a new era. Substantial works have shown that they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch. So can such pre-trained models be applied to multi-modal tasks? Researchers have explored this problem and made significant progress. This paper surveys recent advances and new frontiers in vision-language pre-training (VLP), including image-text and video-text pre-training. To give readers a better overall grasp of VLP, we first review its recent advances in five aspects: feature extraction, model architecture, pre-training objectives, pre-training datasets, and downstream tasks. Then, we summarize the specific VLP models in detail. Finally, we discuss the new frontiers in VLP. To the best of our knowledge, this is the first survey focused on VLP. We hope that this survey can shed light on future research in the VLP field.

show abstract

“…In our case, K = 3 and h i1 = h i ; h i2 = ĥi ; h i3 = hi are embeddings related to semantics (xSem), intents (xIntent) and reactions (xReact) of the i th sentence respectively. Drawing ideas from the literature of multimodal analysis (Urooj et al 2020), we treat the multiple latent vectors as a sequence of features by first concatenating them together. We introduce a special token [F U SE] 5 that accumulates the latent features from different sentence encodings.…”

Section: Transformer-based Fusion Layermentioning

confidence: 99%

M-SENSE: Modeling Narrative Structure in Short Personal Narratives Using Protagonist's Mental Representations

Vijayaraghavan¹,

Roy²

2023

Preprint

View full text Add to dashboard Cite

Narrative is a ubiquitous component of human communication. Understanding its structure plays a critical role in a wide variety of applications, ranging from simple comparative analyses to enhanced narrative retrieval, comprehension, or reasoning capabilities. Prior research in narratology has highlighted the importance of studying the links between cognitive and linguistic aspects of narratives for effective comprehension. This interdependence is related to the textual semantics and mental language in narratives, referring to characters' motivations, feelings or emotions, and beliefs. However, this interdependence is hardly explored for modeling narratives. In this work, we propose the task of automatically detecting prominent elements of the narrative structure by analyzing the role of characters' inferred mental state along with linguistic information at the syntactic and semantic levels. We introduce a STORIES dataset of short personal narratives containing manual annotations of key elements of narrative structure, specifically climax and resolution. To this end, we implement a computational model that leverages the protagonist's mental state information obtained from a pre-trained model trained on social commonsense knowledge and integrates their representations with contextual semantic embed-dings using a multi-feature fusion approach. Evaluating against prior zero-shot and supervised baselines, we find that our model is able to achieve significant improvements in the task of identifying climax and resolution.

show abstract

MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering

Cited by 15 publications

References 38 publications

Video Question Answering: Datasets, Algorithms and Challenges

Video Question Answering: Datasets, Algorithms and Challenges

VLP: A Survey on Vision-language Pre-training

M-SENSE: Modeling Narrative Structure in Short Personal Narratives Using Protagonist's Mental Representations

Contact Info

Product

Resources

About