“…With the great success of deep learning-based methods in speech recognition [10], visual question answering [11] and NLP, scholars have also made some progress in the application of deep learning to continuous SLR [12][13][14]. Many deep learning-based methods have been applied to visual feature extraction and sequence model learning for SLR.…”
Section: Related Workmentioning
confidence: 99%
“…Finally, the sequence feature S is obtained using a weighted residual connection and layer normalisation. As shown in Equations ( 9) and (10):…”
Section: Multi-scale Mixing To Enhance Attentionmentioning
With the goal of solving the problem of feature extractors lacking strong supervision training and insufficient time information concerning single‐sequence model learning, a hierarchical sequence memory network with a multi‐level iterative optimisation strategy is proposed for continuous sign language recognition. This method uses the spatial‐temporal fusion convolution network (STFC‐Net) to extract the spatial‐temporal information of RGB and Optical flow video frames to obtain the multi‐modal visual features of a sign language video. Then, in order to enhance the temporal relationships of visual feature maps, the hierarchical memory sequence network is used to capture local utterance features and global context dependencies across time dimensions to obtain sequence features. Finally, the decoder decodes the final sentence sequence. In order to enhance the feature extractor, the authors adopted a multi‐level iterative optimisation strategy to fine‐tune STFC‐Net and the utterance feature extractor. The experimental results on the RWTH‐Phoenix‐Weather multi‐signer 2014 dataset and the Chinese sign language dataset show the effectiveness and superiority of this method.
“…With the great success of deep learning-based methods in speech recognition [10], visual question answering [11] and NLP, scholars have also made some progress in the application of deep learning to continuous SLR [12][13][14]. Many deep learning-based methods have been applied to visual feature extraction and sequence model learning for SLR.…”
Section: Related Workmentioning
confidence: 99%
“…Finally, the sequence feature S is obtained using a weighted residual connection and layer normalisation. As shown in Equations ( 9) and (10):…”
Section: Multi-scale Mixing To Enhance Attentionmentioning
With the goal of solving the problem of feature extractors lacking strong supervision training and insufficient time information concerning single‐sequence model learning, a hierarchical sequence memory network with a multi‐level iterative optimisation strategy is proposed for continuous sign language recognition. This method uses the spatial‐temporal fusion convolution network (STFC‐Net) to extract the spatial‐temporal information of RGB and Optical flow video frames to obtain the multi‐modal visual features of a sign language video. Then, in order to enhance the temporal relationships of visual feature maps, the hierarchical memory sequence network is used to capture local utterance features and global context dependencies across time dimensions to obtain sequence features. Finally, the decoder decodes the final sentence sequence. In order to enhance the feature extractor, the authors adopted a multi‐level iterative optimisation strategy to fine‐tune STFC‐Net and the utterance feature extractor. The experimental results on the RWTH‐Phoenix‐Weather multi‐signer 2014 dataset and the Chinese sign language dataset show the effectiveness and superiority of this method.
“…at is, SOP only focuses on the order of sentences and has no influence on the subject [18]. Albert model input needs to add [CLS] at the beginning of the text, and the output corresponds to the input [CLS] vector containing the information coding of the whole sentence, which can be used for text classification tasks [19].…”
Aiming at solving the problem that the recognition effect of rare slot values in spoken language is poor, which affects the accuracy of oral understanding task, a spoken language understanding method is designed based on deep learning. The local features of semantic text are extracted and classified to make the classification results match the dialogue task. An intention recognition algorithm is designed for the classification results. Each datum has a corresponding intention label to complete the task of semantic slot filling. The attention mechanism is applied to the recognition of rare slot value information, the weight of hidden state and corresponding slot characteristics are obtained, and the updated slot value is used to represent the tracking state. An auxiliary gate unit is constructed between the upper and lower slots of historical dialogue, and the word vector is trained based on deep learning to complete the task of spoken language understanding. The simulation results show that the proposed method can realize multiple rounds of man-machine spoken language. Compared with the spoken language understanding methods based on cyclic network, context information, and label decomposition, it has higher accuracy and F1 value and has higher practical application value.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.