Everything at Once – Multi-modal Fusion Transformer for Video Retrieval

Shvetsova, Nina; Chen, Brian; Rouditchenko, Andrew; Thomas, Samuel; Kingsbury, Brian; Feris, Rogério; Harwath, David; Glass, James; Kuehne, Hilde

doi:10.1109/cvpr52688.2022.01939

Cited by 76 publications

(26 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Accordingly, more severe challenges are arising from analyzing the multi-modal signals, namely text, audio, facial micro-expression and body actions, to promote the effects of EFL teaching and learning. To this end, large-scale Transformer (Shvetsova et al, 2022 ) can be pre-trained based on a large amount of multi-modal signals in self-supervised learning of multi-modal embedding space. For AI techniques in the EFL context, it is meaningful and challenging to adapt the large-scale multi-modal pre-training model to EFL teaching and learning.…”

Section: Discussionmentioning

confidence: 99%

How does artificial intelligence empower EFL teaching and learning nowadays? A review on artificial intelligence in the EFL context

Jiang

2022

Front. Psychol.

View full text Add to dashboard Cite

The booming Artificial Intelligence (AI) provides fertile ground for AI in education. So far, few reviews have been deployed to explore how AI empowers English as Foreign Language (EFL) teaching and learning. This study attempts to give a brief yet profound overview of AI in the EFL context by summarizing and delineating six dominant forms of AI application, including Automatic Evaluation Systems, Neural Machine Translation Tools, Intelligent Tutoring Systems (ITSs), AI Chatting Robots, Intelligent Virtual Environment, and Affective Computing (AC) in ITSs. The review furthermore uncovers a current paucity of research on applying AC in the EFL context and exploring pedagogical and ethical implications of AI in the EFL context. Ultimately, challenges from technical and teachers' perspectives, as well as future research directions, are illuminated, hopefully proffering new insights for the future study.

show abstract

Section: Discussionmentioning

confidence: 99%

How does artificial intelligence empower EFL teaching and learning nowadays? A review on artificial intelligence in the EFL context

Jiang

2022

Front. Psychol.

View full text Add to dashboard Cite

show abstract

“…Multimodal learning with transformers has been tested in multiple areas, especially in the audiovisual field to join video, language and audio features [34], [35], [36], [37] but also in deepfake detection [38], medical imagery synthesis [39], etc. Most of these proposed models extract embeddings from the modalities without transformers to make the fusion in a single custom transformer.…”

Section: ) Multimodal Transformers Architecturesmentioning

confidence: 99%

Fusion of Satellite Images and Weather Data With Transformer Networks for Downy Mildew Disease Detection

2023

View full text Add to dashboard Cite

Crop diseases significantly affect the quantity and quality of agricultural production. In a context where the goal of precision agriculture is to minimize or even avoid the use of pesticides, weather and remote sensing data with deep learning can play a pivotal role in detecting crop diseases, allowing localized treatment of crops. However, combining heterogeneous data such as weather and images remains a hot topic and challenging task. Recent developments in transformer architectures have shown the possibility of fusion of data from different domains, such as text-image. The current trend is to custom only one transformer to create a multimodal fusion model. Conversely, we propose a new approach to realize data fusion using three transformers. In this paper, we first solved the missing satellite images problem, by interpolating them with a ConvLSTM model. Then, we proposed a multimodal fusion architecture that jointly learns to process visual and weather information. The architecture is built from three main components, a Vision Transformer and two transformer-encoders, allowing to fuse both image and weather modalities. The results of the proposed method are promising achieving an overall accuracy of 97%.

show abstract

“…The MZSL algorithm can be used for modeling as long as a bridge of knowledge transfer can be built between the seen and unseen classes in the task scenarios, such as using the semantic attributes of the classes. Instead of an introduction that concentrates on the applications themselves, several datasets available in various scenarios are offered to readers as guidelines, such as cross‐modal classification and retrieval (Geigle et al, 2022; Mercea et al, 2022; Parida et al, 2020; Shvetsova et al, 2022; Wray et al, 2019), cross‐lingual retrieval (P.‐Y. Huang, Patrick, et al, 2021), sketch‐based image retrieval (Jing et al, 2022), code search (D. Guo, Lu, Duan, et al, 2022), visual question answering (Z. Chen, Chen, et al, 2021), event detection (Elhoseiny et al, 2016; S. Wu et al, 2014), visual grounding (Tziafas & Kasaei, 2021), natural language grounding (Sinha et al, 2019), semantic image manipulation (S. H. Lee et al, 2022), medical image segmentation (Bian et al, 2022), video object segmentation (Zhao et al, 2021), sign language recognition (Madapana, 2020), tactile object recognition (H. Liu et al, 2018), and driver behavior recognition (Reiß et al, 2020).…”

Section: Model Evaluation Metrics and Datasets For Mzslmentioning

confidence: 99%

A review on multimodal zero‐shot learning

Cao

Sun

et al. 2023

WIREs Data Min & Knowl

View full text Add to dashboard Cite

Multimodal learning provides a path to fully utilize all types of information related to the modeling target to provide the model with a global vision. Zero-shot learning (ZSL) is a general solution for incorporating prior knowledge into data-driven models and achieving accurate class identification. The combination of the two, known as multimodal ZSL (MZSL), can fully exploit the advantages of both technologies and is expected to produce models with greater generalization ability. However, the MZSL algorithms and applicationshave not yet been thoroughly investigated and summarized. This study fills this gap by providing an objective overview of MZSL's definition, typical algorithms, representative applications, and critical issues. This article will not only provide researchers in this field with a comprehensive perspective, but it will also highlight several promising research directions.

show abstract

Everything at Once – Multi-modal Fusion Transformer for Video Retrieval

Cited by 76 publications

References 26 publications

How does artificial intelligence empower EFL teaching and learning nowadays? A review on artificial intelligence in the EFL context

How does artificial intelligence empower EFL teaching and learning nowadays? A review on artificial intelligence in the EFL context

Fusion of Satellite Images and Weather Data With Transformer Networks for Downy Mildew Disease Detection

A review on multimodal zero‐shot learning

Contact Info

Product

Resources

About