Findings of the Association for Computational Linguistics: EMNLP 2020 2020
DOI: 10.18653/v1/2020.findings-emnlp.417
|View full text |Cite
|
Sign up to set email alerts
|

MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering

Abstract: We present MMFT-BERT (MultiModal Fusion Transformer with BERT encodings), to solve Visual Question Answering (VQA) ensuring individual and combined processing of multiple input modalities. Our approach benefits from processing multimodal data (video and text) adopting the BERT encodings individually and using a novel transformerbased fusion method to fuse them together. Our method decomposes the different sources of modalities, into different BERT instances with similar architectures, but variable weights. Thi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
9
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 15 publications
(9 citation statements)
references
References 38 publications
0
9
0
Order By: Relevance
“…Motivated by the success of Transformer, Li et al first introduce the architecture of Transformer without pre-training to VideoQA (PSAC), which consists of two positional selfattention blocks to replace LSTM, and a video-question coattention block to simultaneously attend both visual and textual information. [Yang et al, 2020] and [Urooj et al, 2020] incorporate the pre-trained language-based Transformer (BERT) [Devlin et al, 2019] to movie and story understanding, which requires more modeling on languages like subtitles and dialogues. Both works process each of the input modalities such as video and subtitles, with question and candidate answer, respectively, and lately fuse several streams for the final answer.…”
Section: Methodsmentioning
confidence: 99%
“…Motivated by the success of Transformer, Li et al first introduce the architecture of Transformer without pre-training to VideoQA (PSAC), which consists of two positional selfattention blocks to replace LSTM, and a video-question coattention block to simultaneously attend both visual and textual information. [Yang et al, 2020] and [Urooj et al, 2020] incorporate the pre-trained language-based Transformer (BERT) [Devlin et al, 2019] to movie and story understanding, which requires more modeling on languages like subtitles and dialogues. Both works process each of the input modalities such as video and subtitles, with question and candidate answer, respectively, and lately fuse several streams for the final answer.…”
Section: Methodsmentioning
confidence: 99%
“…VideoBERT [58] , known as the first video-text pre-training model, extends the BERT model to process videos and texts simultaneously. Video-BERT uses the pre-trained ConvNet and S3D [133] to extract video features and concatenate them with textual word embeddings to feed into a transformer initialized with BERT. ConvNet and S3D are frozen when training the VideoBERT, which indicates the approach is not endto-end.…”
Section: Sota Vlp Modelsmentioning
confidence: 99%
“…In our case, K = 3 and h i1 = h i ; h i2 = ĥi ; h i3 = hi are embeddings related to semantics (xSem), intents (xIntent) and reactions (xReact) of the i th sentence respectively. Drawing ideas from the literature of multimodal analysis (Urooj et al 2020), we treat the multiple latent vectors as a sequence of features by first concatenating them together. We introduce a special token [F U SE] 5 that accumulates the latent features from different sentence encodings.…”
Section: Transformer-based Fusion Layermentioning
confidence: 99%