Video Transformers: A Survey

Selva, Javier; Johansen, Anders Skaarup; Escalera, Sérgio; Nasrollahi, Kamal; Moeslund, Thomas B.; Clapés, Albert

doi:10.1109/tpami.2023.3243465

Cited by 43 publications

(18 citation statements)

References 141 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, the limited training data and the model complexities remained one of the primary factors of model performance. Transformers have also been used for tasks beyond NLP, such as image and video processing [95], and they are an active area of research in the deep learning community.…”

Section: Introductionmentioning

confidence: 99%

ChatGPT in Healthcare: A Taxonomy and Systematic Review

Li¹,

Amin²,

Kleesiek³

et al. 2023

Preprint

View full text Add to dashboard Cite

The recent release of ChatGPT, a chat bot research project/product of natural language processing (NLP) by OpenAI, stirs up a sensation among both the general public and medical professionals, amassing a phenomenally large user base in a short time. This is a typical example of the 'productization' of cutting-edge technologies, which allows the general public without a technical background to gain firsthand experience in artificial intelligence (AI), similar to the AI hype created by AlphaGo (DeepMind Technologies, UK) and self-driving cars (Google, Tesla, etc.). However, it is crucial, especially for healthcare researchers, to remain prudent amidst the hype. This work provides a systematic review of existing publications on the use of ChatGPT in healthcare, elucidating the 'status quo' of ChatGPT in medical applications, for general readers, healthcare professionals as well as NLP scientists. The large biomedical literature databasePubMedis used to retrieve published works on this topic using the keyword 'ChatGPT'. An inclusion criterion and a taxonomy are further proposed to filter the search results and categorize the selected publications, respectively. It is found through the review that the current release of ChatGPT has achieved only moderate or 'passing' performance in a variety of tests, and is unreliable for actual clinical deployment, since it is not intended for clinical applications by design. We conclude that specialized NLP models trained on (bio)medical datasets still represent the right direction to pursue for critical clinical applications.

show abstract

Section: Introductionmentioning

confidence: 99%

ChatGPT in Healthcare: A Taxonomy and Systematic Review

Li¹,

Amin²,

Kleesiek³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Interpretable spatio-temporal attention [48] used spatial and temporal attention via ConvLSTM. Recent selfattention mechanisms are also introduced in STA-TSN [49] and GTA [50], as well as Transformer-based video models [3]. Although some of these methods do not aim to visual explanation, the blurry map issue still remains for videos because the ability of temporal modeling, which is useful for classification, may be harmful to capture sharp spatial attention maps.…”

Section: Related Workmentioning

confidence: 99%

Object-ABN: Learning to Generate Sharp Attention Maps for Action Recognition

Nitta

Hirakawa

Fujiyoshi

et al. 2023

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

In this paper we propose an extension of the Attention Branch Network (ABN) by using instance segmentation for generating sharper attention maps for action recognition. Methods for visual explanation such as Grad-CAM usually generate blurry maps which are not intuitive for humans to understand, particularly in recognizing actions of people in videos. Our proposed method, Object-ABN, tackles this issue by introducing a new mask loss that makes the generated attention maps close to the instance segmentation result. Further the Prototype Conformity (PC) loss and multiple attention maps are introduced to enhance the sharpness of the maps and improve the performance of classification. Experimental results with UCF101 and SSv2 shows that the generated maps by the proposed method are much clearer qualitatively and quantitatively than those of the original ABN.

show abstract

“…In multimodal learning, models process and integrate data from multiple modalities [5,6,45], with applications in visual and language learning [43], video understanding [46,47], and natural language understanding [29,30,35]. However, expensive human annotations are often required for effective training.…”

Section: Self-supervised Multimodal Learningmentioning

confidence: 99%

Prompt Me Up: Unleashing the Power of Alignments for Multimodal Entity and Relation Extraction

Hu,

Chen,

Liu

et al. 2023

Proceedings of the 31st ACM International Conference on Multimedia

View full text Add to dashboard Cite

How can we better extract entities and relations from text? Using multimodal extraction with images and text obtains more signals for entities and relations, and aligns them through graphs or hierarchical fusion, aiding in extraction. Despite attempts at various fusions, previous works have overlooked many unlabeled image-caption pairs, such as NewsCLIPing. This paper proposes innovative pretraining objectives for entity-object and relation-image alignment, extracting objects from images and aligning them with entity and relation prompts for soft pseudo-labels. These labels are used as self-supervised signals for pre-training, enhancing the ability to extract entities and relations. Experiments on three datasets show an average 3.41% F1 improvement over prior SOTA. Additionally, our method is orthogonal to previous multimodal fusions, and using

show abstract

Video Transformers: A Survey

Cited by 43 publications

References 141 publications

ChatGPT in Healthcare: A Taxonomy and Systematic Review

ChatGPT in Healthcare: A Taxonomy and Systematic Review

Object-ABN: Learning to Generate Sharp Attention Maps for Action Recognition

Prompt Me Up: Unleashing the Power of Alignments for Multimodal Entity and Relation Extraction

Contact Info

Product

Resources

About