Deep convolutional neural network (DCNN) and recurrent neural network (RNN) have been proved as an imperious research area in multimedia understanding and obtained remarkable action recognition performance. However, videos contain rich motion information with varying dimensions. Existing recurrent based pipelines fail to capture long-term motion dynamics in videos with various motion scales and complex actions performed by multiple actors. Consideration of contextual and salient features is more important than mapping a video frame into a static video representation. This research work provides a novel pipeline by analyzing and processing the video information using a 3D convolution (C3D) network and newly introduced deep bidirectional LSTM. Like popular two-stream convent, we also introduce a two-stream framework with one modification; that is, we replace the optical flow stream by saliency-aware stream to avoid the computational complexity. First, we generate a saliency-aware video stream by applying the saliency-aware method. Secondly, a two-stream 3D-convolutional network (C3D) is utilized with two different types of streams, i.e., RGB stream and saliency-aware video stream, to collect both spatial and semantic temporal features. Next, a deep bidirectional LSTM network is used to learn sequential deep temporal dynamics. Finally, time-series-pooling-layer and softmax-layers classify human activity and behavior. The introduced system can learn long-term temporal dependencies and can predict complex human actions. Experimental results demonstrate the significant improvement in action recognition accuracy on different benchmark datasets.
With an astounding five million fatal cases every year, lung cancer is among the leading causes of mortality worldwide for both men and women. The diagnosis of lung illnesses can benefit from the information a computed tomography (CT) scan can offer. The major goals of this study are to diagnose lung cancer and its seriousness and to identify malignant lung nodules from the provided input lung picture. This paper applies unique deep learning techniques to identify the exact location of the malignant lung nodules. Using a DenseNet model, mixed ground glass is analyzed in low-dose, low-resolution CT scan images of nodules (mGGNs) with a slice thickness of 5 mm in this study. This was done to categorize and identify many histological subtypes of lung cancer. Low-resolution CT scans are used to pathologically classify invasive adenocarcinoma (IAC) and minimally invasive adenocarcinoma (MIA). 105 low-resolution CT images with 5 mm thick slices from 105 patients at Lishui Central Hospital were selected. To detect and distinguish, IAC and MIA, extend and enhance deep learning two- and three-dimensional DenseNet models are used. The two-dimensional DenseNet model was shown to perform much better than the three-dimensional DenseNet model in terms of classification accuracy (76.67%), sensitivity (63.3%), specificity (100%), and area under the receiver operating characteristic curve (0.88). Finding the histological subtypes of persons with lung cancer should aid doctors in making a more precise diagnosis, even if the image quality is not outstanding.
Communication with a hearing-impaired individual is a big challenge for a normal person. Hearingimpaired people uses hand gesture language (sign language) to communicate with each other, which is not easy to understand by a normal person because he/she is not trained to understand sign language. This communication gap between a hearing-impaired and a normal person created big problem for hearing-impaired individuals during their shopping, hospitalization, at their schools and homes. Especially in case of emergency, it is very difficult to understand the statement of a hearing-impaired one's who uses sign language. In the last few years researchers and developers from all over the world presented different ideas and works to solve this problem but no such solution is available to resolve this issue and can create two-way communication between hearing-impaired and normal persons. This paper presented a detail description about a two-way communication system based on Pakistan Sign Language (PSL). This duplex system is developed through conversion from the text in simple English into hand gestures and vice versa. However, conversion from hand gestures is available not only in text but also with voice providing more convenience to normal person. Main objective is to facilitate a large population and making hearingimpaired persons, the vital part of our civilization. A normal person can enter the text (sentence) in application, after the checking of spelling and grammar, the text is divided into tokens and sub-tokens. A token is a gesture against each word of the text while sub-tokens are the gestures of each character of the words. The combination of tokens created the gestures of text. On the other hand when gestures were input in to the application, using image processing technique, the nature of hand gesture were recognized and converted into corresponding text or voice.
AbstractThis paper discusses the concept of object's shape identification using local binary pattern technique (LBP). Since LBP is computationally simple it has been utilized successfully for recognition of various objects. LBP which has the potential to be used in various identification related fields was applied on a number of different shaped objects, the process converted the given image in to 3x3 binary matrices and several rounds of computation yields the final decision parameter, which is known as merit function. This parameter was then exploited to uniquely identify the shape of different objects.
Event detection of rare and complex events in large video datasets or in unconstrained user-uploaded videos on internet is a challenging task. The presence of irregular camera movement, viewpoint changes, illumination variations and significant changes in the background make extremely difficult to capture underlying motion in videos. In addition, extraction of features using different modalities (single streams) may offer computational complexities and cause abstraction of confusing and irrelevant spatial and semantic features. To address this problem, we present a single stream (RGB only) based on feature of spatial and semantic features extracted by modified 3D Residual Convulsion Network. We combine the spatial and semantic features based on this assumption that difference between both types of features can discover the accurate and relevant features. Moreover, introduction of temporal encoding builds the relationship in consecutive video frames to explore discriminative long-term motion patterns. We conduct extensive experiments on prominent publically available datasets. The obtained results demonstrate the great power of our proposed model and improved accuracy compared with existing state-ofthe-art methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.