2023
DOI: 10.3390/s23020734
|View full text |Cite
|
Sign up to set email alerts
|

Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data?

Abstract: Understanding actions in videos remains a significant challenge in computer vision, which has been the subject of several pieces of research in the last decades. Convolutional neural networks (CNN) are a significant component of this topic and play a crucial role in the renown of Deep Learning. Inspired by the human vision system, CNN has been applied to visual data exploitation and has solved various challenges in various computer vision tasks and video/image analysis, including action recognition (AR). Howev… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
15
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
7

Relationship

0
7

Authors

Journals

citations
Cited by 49 publications
(30 citation statements)
references
References 131 publications
0
15
0
Order By: Relevance
“…By considering the global information of the image, ViT is more competitive in some visual classification tasks [44,45]. Scholars have proved that ViT can be used in traffic sign classification [46], plant disease detection [47], and face recognition [48]. Vit-based transfer learning begins to attract more and more attention [49][50][51].…”
Section: Transfer Learning Overviewmentioning
confidence: 99%
“…By considering the global information of the image, ViT is more competitive in some visual classification tasks [44,45]. Scholars have proved that ViT can be used in traffic sign classification [46], plant disease detection [47], and face recognition [48]. Vit-based transfer learning begins to attract more and more attention [49][50][51].…”
Section: Transfer Learning Overviewmentioning
confidence: 99%
“…It is a subset of neural networks typically employed in contexts of three dimensional units, namely the height, width and depth used for image analysis and also applies to object classification [43]. Since CNNs can learn directly from the raw time series data, extract features from sequences of observations, and require neither domain expertise nor manually engineered input characteristics this current study plan to utilise it to detect the attributes associated to the activities of elderly people in determining their unusual behavior [44].…”
Section: The Convolutional Neural Network Architecturementioning
confidence: 99%
“…However, CNN has a few weaknesses, including a slowness that is brought on by the max pooling operation; additionally, in contrast to the Transformer, it does not consider several perspectives that can be gained by learning, [121] which leads to disregard for global knowledge. Because it offers solutions to CNN's numerous weaknesses, Transformer has quickly become CNN's most formidable opponent [122] . The capability of the Transformer to prioritize relevant content while minimizing the repetition of unimportant content is its strength [123] .…”
Section: The Roles Of Transformers In Predicting the Use Of Drug Comb...mentioning
confidence: 99%
“…However, it should be mentioned that both algorithms (i.e., CNN and Transformer) have their own shortcomings and benefits and it is still difficult to determine who will win this race. Nevertheless, the hybrid method, which is the most attractive formula because it enables us to take advantage of a model's strengths while simultaneously reducing the effects of that model's downsides, is more efficient and cost‐effective [122] . It combines CNN with transformers to provide a reliable model.…”
Section: The Roles Of Transformers In Predicting the Use Of Drug Comb...mentioning
confidence: 99%
See 1 more Smart Citation