2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.01939
|View full text |Cite
|
Sign up to set email alerts
|

Everything at Once – Multi-modal Fusion Transformer for Video Retrieval

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
19
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 76 publications
(26 citation statements)
references
References 26 publications
0
19
0
Order By: Relevance
“…Accordingly, more severe challenges are arising from analyzing the multi-modal signals, namely text, audio, facial micro-expression and body actions, to promote the effects of EFL teaching and learning. To this end, large-scale Transformer (Shvetsova et al, 2022 ) can be pre-trained based on a large amount of multi-modal signals in self-supervised learning of multi-modal embedding space. For AI techniques in the EFL context, it is meaningful and challenging to adapt the large-scale multi-modal pre-training model to EFL teaching and learning.…”
Section: Discussionmentioning
confidence: 99%
“…Accordingly, more severe challenges are arising from analyzing the multi-modal signals, namely text, audio, facial micro-expression and body actions, to promote the effects of EFL teaching and learning. To this end, large-scale Transformer (Shvetsova et al, 2022 ) can be pre-trained based on a large amount of multi-modal signals in self-supervised learning of multi-modal embedding space. For AI techniques in the EFL context, it is meaningful and challenging to adapt the large-scale multi-modal pre-training model to EFL teaching and learning.…”
Section: Discussionmentioning
confidence: 99%
“…Multimodal learning with transformers has been tested in multiple areas, especially in the audiovisual field to join video, language and audio features [34], [35], [36], [37] but also in deepfake detection [38], medical imagery synthesis [39], etc. Most of these proposed models extract embeddings from the modalities without transformers to make the fusion in a single custom transformer.…”
Section: ) Multimodal Transformers Architecturesmentioning
confidence: 99%
“…The MZSL algorithm can be used for modeling as long as a bridge of knowledge transfer can be built between the seen and unseen classes in the task scenarios, such as using the semantic attributes of the classes. Instead of an introduction that concentrates on the applications themselves, several datasets available in various scenarios are offered to readers as guidelines, such as cross‐modal classification and retrieval (Geigle et al, 2022; Mercea et al, 2022; Parida et al, 2020; Shvetsova et al, 2022; Wray et al, 2019), cross‐lingual retrieval (P.‐Y. Huang, Patrick, et al, 2021), sketch‐based image retrieval (Jing et al, 2022), code search (D. Guo, Lu, Duan, et al, 2022), visual question answering (Z. Chen, Chen, et al, 2021), event detection (Elhoseiny et al, 2016; S. Wu et al, 2014), visual grounding (Tziafas & Kasaei, 2021), natural language grounding (Sinha et al, 2019), semantic image manipulation (S. H. Lee et al, 2022), medical image segmentation (Bian et al, 2022), video object segmentation (Zhao et al, 2021), sign language recognition (Madapana, 2020), tactile object recognition (H. Liu et al, 2018), and driver behavior recognition (Reiß et al, 2020).…”
Section: Model Evaluation Metrics and Datasets For Mzslmentioning
confidence: 99%