Sign language fills the communication gap for people with hearing and speaking ailments. It includes both visual modalities, manual gestures consisting of movements of hands, and non-manual gestures incorporating body movements including head, facial expressions, eyes, shoulder shrugging, etc. Previously both gestures have been detected; identifying separately may have better accuracy, but much communicational information is lost. A proper sign language mechanism is needed to detect manual and non-manual gestures to convey the appropriate detailed message to others. Our novel proposed system contributes as Sign Language Action Transformer Network (SLATN), localizing hand, body, and facial gestures in video sequences. Here we are expending a Transformer-style structural design as a "base network" to extract features from a spatiotemporal domain. The model impulsively learns to track individual persons and their action context in multiple frames. Furthermore, a "head network" emphasizes hand movement and facial expression simultaneously, which is often crucial to understanding sign language, using its attention mechanism for creating tight bounding boxes around classified gestures. The model's work is later compared with the traditional identification methods of activity recognition. It not only works faster but achieves better accuracy as well. The model achieves overall 82.66% testing accuracy with a very considerable performance of computation with 94.13 Giga-Floating Point Operations per Second (G-FLOPS). Another contribution is a newly created dataset of Pakistan Sign Language for Manual and Non-Manual (PkSLMNM) gestures.
Sign language recognition is a significant cross-modal way to fill the communication gap between deaf and hearing people. Automatic Sign Language Recognition (ASLR) translates sign language gestures into text and spoken words. Several researchers are focusing either on manual gestures or non-manual gestures separately; a rare focus is on concurrent recognition of manual and non-manual gestures. Facial expression and other body movements can improve the accuracy rate, as well as enhance signs’ exact meaning. The current paper proposes a Multimodal –Sign Language Recognition (MM-SLR) framework to recognize non-manual features based on facial expressions along with manual gestures in Spatio temporal domain representing hand movements in ASLR. Our proposed architecture has three modules, first, a modified architecture of YOLOv5 is defined to extract faces and hands from videos as two Regions of Interest. Second, refined C3D architecture is used to extract features from the hand region and the face region, further, feature concatenation of both modalities is applied. Lastly, LSTM network is used to get spatial-temporal descriptors and attention-based sequential modules for gesture classification. To validate the proposed framework we used three publically available datasets RWTH-PHONIX-WEATHER-2014T, SILFA and PkSLMNM. Experimental results show that the above-mentioned MM-SLR framework outperformed on all datasets.
The latest standard for video coding is versatile video coding (VVC) / H.266 which is developed by the joint video exploration team (JVET). Its coding structure is a multi-type tree (MTT) structure. There are two types of trees under the umbrella of the MTT structure. The first one is called a ternary tree (TT) and the second one is a binary tree (BT). Due to the use of brute force quest for residual rate distortion the quad tree and multi-type tree (QTMT) structure of coding unit (CU) split and contributes over 98% of the encoding time. This structure is efficient in coding, however increases computational complexity. Current paper proposes a deep learning technique to predict the QTMT based CU split rather than just the brute-force QTMT method to substantially speed up the time of encoding process for VVC/H.266 intra mode. In the first phase we developed an extensive database containing the ample CU splitting patterns with various streaming videos that is able to encourage the significant decrease of VVC/H.266 complexity by using data driven methods. in the Second phase, in accordance with the dynamic QT-MT structure at numerous locations, we suggest a multi-level exit CNN (MLE-CNN) model with a redundancy removal mechanism at different levels to determine the CU partition. In the third phase, for the training of MLE-CNN model we have established the adaptive loss function and analyzing the both unknown number of partition modes and the focus on RD cost minimization. Finally, a variable threshold decision system is established to achieve the targeted low complexity and RD performance. Ultimately experimental findings show that VVC/H.266 encoding time has reduced to 69.11% from 47.91% with insignificant bjontegaard delta bit rate (BDBR) to 2.919% from 1.023% which performs better than the existing futuristic and modern approaches.
It is a challenging task to interpret sign language automatically, as it comprises high-level vision features to accurately understand and interpret the meaning of the signer or vice versa. In the current study, we automatically distinguish hand signs and classify seven basic gestures representing symbolic emotions or expressions like happy, sad, neutral, disgust, scared, anger, and surprise. Convolutional Neural Network is a famous method for classifications using vision-based deep learning; here in the current study, proposed transfer learning using a well-known architecture of VGG16 to speed up the convergence and improve accuracy by using pre-trained weights. We obtained a high accuracy of 99.98% of the proposed architecture with a minimal and low-quality data set of 455 images collected by 65 individuals for seven hand gesture classes. Further, compared the performance of VGG16 architecture with two different optimizers, SGD, and Adam, along with some more architectures of Alex Net, LeNet05, and ResNet50.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.