“…Ten papers use 3D CNNs for feature extraction [35,66,67,76,82,83,86,88,96,99]. These networks are able to extract spatio-temporal features, leveraging the temporal relations between neighboring frames in video data.…”
Section: Extraction Methodsmentioning
confidence: 99%
“…A simple approach to feature extraction is to consider full video frames as inputs. Performing further pre-processing of the visual information to target hands, face and pose information separately (referred to as a multi-cue approach) improves the performance of SLT models [36,59,65,75,80,86,96]. Zheng et al [75] show through qualitative analysis that adding facial feature extraction improves translation accuracy in utterances where facial expressions are used.…”
Section: Multi-cue Approachesmentioning
confidence: 99%
“…Dey et al [96] observe improvements in BLEU scores when adding lip reading as an input channel. By adding face crops as an additional channel, Miranda et al [86] improve the performance of the TSPNet architecture [66].…”
Section: Multi-cue Approachesmentioning
confidence: 99%
“…The popularity of the RWTH-PHOENIX-Weather 2014T dataset facilitates the comparison of different SLT models on this dataset. We compare models based on their BLEU-4 score as this is the only metric consistently reported on in all of the papers using RWTH-PHOENIX-Weather 2014T (except [86]). An overview of Gloss2Text models is shown in Table 3.…”
Section: The Rwth-phoenix-weather 2014t Benchmarkmentioning
confidence: 99%
“…Zheng et al [75] use an additional channel of facial information for Sign2Text and obtain an increase of 1.6 BLEU-4 compared to their baseline. Miranda et al [86] augment TSPNet [66] with face crops, improving the performance of the network.…”
Automatic translation from signed to spoken languages is an interdisciplinary research domain on the intersection of computer vision, machine translation (MT), and linguistics. While the domain is growing in terms of popularity—the majority of scientific papers on sign language (SL) translation have been published in the past five years—research in this domain is performed mostly by computer scientists in isolation. This article presents an extensive and cross-domain overview of the work on SL translation. We first give a high level introduction to SL linguistics and MT to illustrate the requirements of automatic SL translation. Then, we present a systematic literature review of the state of the art in the domain. Finally, we outline important challenges for future research. We find that significant advances have been made on the shoulders of spoken language MT research. However, current approaches often lack linguistic motivation or are not adapted to the different characteristics of SLs. We explore challenges related to the representation of SL data, the collection of datasets and the evaluation of SL translation models. We advocate for interdisciplinary research and for grounding future research in linguistic analysis of SLs. Furthermore, the inclusion of deaf and hearing end users of SL translation applications in use case identification, data collection, and evaluation, is of utmost importance in the creation of useful SL translation models.
“…Ten papers use 3D CNNs for feature extraction [35,66,67,76,82,83,86,88,96,99]. These networks are able to extract spatio-temporal features, leveraging the temporal relations between neighboring frames in video data.…”
Section: Extraction Methodsmentioning
confidence: 99%
“…A simple approach to feature extraction is to consider full video frames as inputs. Performing further pre-processing of the visual information to target hands, face and pose information separately (referred to as a multi-cue approach) improves the performance of SLT models [36,59,65,75,80,86,96]. Zheng et al [75] show through qualitative analysis that adding facial feature extraction improves translation accuracy in utterances where facial expressions are used.…”
Section: Multi-cue Approachesmentioning
confidence: 99%
“…Dey et al [96] observe improvements in BLEU scores when adding lip reading as an input channel. By adding face crops as an additional channel, Miranda et al [86] improve the performance of the TSPNet architecture [66].…”
Section: Multi-cue Approachesmentioning
confidence: 99%
“…The popularity of the RWTH-PHOENIX-Weather 2014T dataset facilitates the comparison of different SLT models on this dataset. We compare models based on their BLEU-4 score as this is the only metric consistently reported on in all of the papers using RWTH-PHOENIX-Weather 2014T (except [86]). An overview of Gloss2Text models is shown in Table 3.…”
Section: The Rwth-phoenix-weather 2014t Benchmarkmentioning
confidence: 99%
“…Zheng et al [75] use an additional channel of facial information for Sign2Text and obtain an increase of 1.6 BLEU-4 compared to their baseline. Miranda et al [86] augment TSPNet [66] with face crops, improving the performance of the network.…”
Automatic translation from signed to spoken languages is an interdisciplinary research domain on the intersection of computer vision, machine translation (MT), and linguistics. While the domain is growing in terms of popularity—the majority of scientific papers on sign language (SL) translation have been published in the past five years—research in this domain is performed mostly by computer scientists in isolation. This article presents an extensive and cross-domain overview of the work on SL translation. We first give a high level introduction to SL linguistics and MT to illustrate the requirements of automatic SL translation. Then, we present a systematic literature review of the state of the art in the domain. Finally, we outline important challenges for future research. We find that significant advances have been made on the shoulders of spoken language MT research. However, current approaches often lack linguistic motivation or are not adapted to the different characteristics of SLs. We explore challenges related to the representation of SL data, the collection of datasets and the evaluation of SL translation models. We advocate for interdisciplinary research and for grounding future research in linguistic analysis of SLs. Furthermore, the inclusion of deaf and hearing end users of SL translation applications in use case identification, data collection, and evaluation, is of utmost importance in the creation of useful SL translation models.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.