From the past few decades, Human activity recognition (HAR) is one of the vital research areas in computer vision in which much research is ongoing. The researcher's focus is shifting towards this area due to its vast range of real-life applications to assist in daily living. Therefore, it is necessary to validate its performance on standard benchmark datasets and state-of-the-art systems before applying it in real-life applications. The primary objective of this Systematic Literature Review (SLR) is to collect existing research on video-based human activity recognition, summarize, and analyze the state-of-the-art deep learning architectures regarding various methodologies, challenges, and issues. The top five scientific databases (such as ACM, IEEE, ScienceDirect, SpringerLink, and Taylor & Francis) are accessed to accompany this systematic study by summarizing 70 different research articles on human activity recognition after critical review. Human activity recognition in videos is a challenging problem due to its diverse and complex nature. For accurate video classification, extraction of both spatial and temporal features from video sequences is essential. Therefore, this SLR focuses on reviewing the recent advancements in stratified self-deriving feature-based deep learning architectures. Furthermore, it explores various deep learning techniques available for HAR, challenges researchers to face to build a robust model, and state-of-theart datasets used for evaluation. This SLR intends to provide a baseline for video-based human activity recognition research while emphasizing several challenges regarding human activity recognition accuracy in video sequences using deep neural architectures.
Human Activity Recognition (HAR) is an active research area due to its applications in pervasive computing, human-computer interaction, artificial intelligence, health care, and social sciences. Moreover, dynamic environments and anthropometric differences between individuals make it harder to recognize actions. This study focused on human activity in video sequences acquired with an RGB camera because of its vast range of real-world applications. It uses two-stream ConvNet to extract spatial and temporal information and proposes a fine-tuned deep neural network. Moreover, the transfer learning paradigm is adopted to extract varied and fixed frames while reusing object identification information. Six state-of-the-art pre-trained models are exploited to find the best model for spatial feature extraction. For temporal sequence, this study uses dense optical flow following the two-stream Con-vNet and Bidirectional Long Short Term Memory (BiLSTM) to capture longterm dependencies. Two state-of-the-art datasets, UCF101 and HMDB51, are used for evaluation purposes. In addition, seven state-of-the-art optimizers are used to fine-tune the proposed network parameters. Furthermore, this study utilizes an ensemble mechanism to aggregate spatial-temporal features using a four-stream Convolutional Neural Network (CNN), where two streams use RGB data. In contrast, the other uses optical flow images. Finally, the proposed ensemble approach using max hard voting outperforms state-ofthe-art methods with 96.30% and 90.07% accuracies on the UCF101 and HMDB51 datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.