A Comparative Study on End-to-end Speech to Text Translation

Bahar, Parnia; Bieschke, Tobias; Ney, Hermann

doi:10.48550/arxiv.1911.08870

Cited by 3 publications

(3 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are two main research paradigms for ST, the end-to-end model, and the cascaded system (Sperber and Paulik, 2020;nie, 2019). End-to-end ST Previous works (Bérard et al, 2016;Duong et al, 2016) have proved the potential for end-to-end ST, which has attracted intensive attentions (Vila et al, 2018;Salesky et al, 2018Salesky et al, , 2019bDi Gangi et al, 2019a;Bahar et al, 2019a;Di Gangi et al, 2019b;Inaguma et al, 2020). It's proved that pre-training (Weiss et al, 2017;Bérard et al, 2018;Bansal et al, 2018;Stoian et al, 2020) and multi-task learning (Vydana et al, 2020) can significantly improve the performance.…”

Section: Related Workmentioning

confidence: 99%

"Listen, Understand and Translate": Triple Supervision Decouples End-to-end Speech-to-text Translation

Dong¹,

Ye²,

Wang³

et al. 2020

Preprint

View full text Add to dashboard Cite

An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language. Inspired by neuroscience, humans have perception systems and cognitive systems to process different information, we propose TED, Transducer-Encoder-Decoder, a unified framework with triple supervision to decouple the end-to-end speechto-text translation task. In addition to the target sentence translation loss, TED includes two auxiliary supervising signals to guide the acoustic transducer that extracts acoustic features from the input, and the semantic encoder to extract semantic features relevant to the source transcription text. Our method achieves state-of-the-art performance on both English-French and English-German speech translation benchmarks.

show abstract

Section: Related Workmentioning

confidence: 99%

"Listen, Understand and Translate": Triple Supervision Decouples End-to-end Speech-to-text Translation

Dong¹,

Ye²,

Wang³

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…For speech translation, there are two main research paradigms, the end-to-end model and the cascaded system (Sperber and Paulik, 2020;nie, 2019). End-to-end ST Previous works (Bérard et al, 2016;Duong et al, 2016) have given the first proof of the potential for end-to-end speech-totext translation, which has attracted intensive attentions recently (Vila et al, 2018;Salesky et al, 2018Salesky et al, , 2019bDi Gangi et al, 2019a;Bahar et al, 2019a;Di Gangi et al, 2019b;Inaguma et al, 2020). Many works have proved that pre-training then transferring (Weiss et al, 2017;Bérard et al, 2018;Bansal et al, 2018;Stoian et al, 2020) and multi-task learning (Vydana et al, 2020) can significantly improve the performance of end-to-end models.…”

Section: Related Workmentioning

confidence: 99%

Consecutive Decoding for Speech-to-text Translation

Dong¹,

Wang²,

Zhou³

et al. 2020

Preprint

View full text Add to dashboard Cite

End-to-end speech-to-text translation (ST), which directly translates the source language speech to the target language text, has attracted intensive attention recently. However, the combination of speech recognition and machine translation in a single model poses a heavy burden on the direct cross-modal cross-lingual mapping. To reduce the learning difficulty, we propose SDST, an integral framework with Successive Decoding for end-to-end Speechto-text Translation task. This method is verified in two mainstream datasets. Experiments show that our proposed SDST improves the previous state-of-the-art methods by big margins.

show abstract

“…Automatic Sign Language Recognition (ASLR) is a challenging task and an active research field with the aim of reducing the dependency of sign language interpreters in the daily lives of the Deaf. Among the many similar problems attempted by deep learning researchers, sign language recognition bears a resemblance to video-based action recognition because of its shared medium of information (Varol et al, 2017), and to speech recognition and machine translation problems (Bahar et al, 2019;Bahdanau et al, 2017), due to its linguistic nature. However, there are certain aspects of ASLR that makes the task more challenging, one of which is the asynchronous multi-articulatory nature of the sign (Sutton-Spence and Woll, 1999).…”

Section: Introductionmentioning

confidence: 99%

BosphorusSign22k Sign Language Recognition Dataset

Özdemir,

Kındıroğlu,

Camgöz

et al. 2020

Preprint

View full text Add to dashboard Cite

Sign Language Recognition is a challenging research domain. It has recently seen several advancements with the increased availability of data. In this paper, we introduce the BosphorusSign22k, a publicly available large scale sign language dataset aimed at computer vision, video recognition and deep learning research communities. The primary objective of this dataset is to serve as a new benchmark in Turkish Sign Language Recognition for its vast lexicon, the high number of repetitions by native signers, high recording quality, and the unique syntactic properties of the signs it encompasses. We also provide state-of-the-art human pose estimates to encourage other tasks such as Sign Language Production. We survey other publicly available datasets and expand on how BosphorusSign22k can contribute to future research that is being made possible through the widespread availability of similar Sign Language resources. We have conducted extensive experiments and present baseline results to underpin future research on our dataset.

show abstract

A Comparative Study on End-to-end Speech to Text Translation

Cited by 3 publications

References 0 publications

"Listen, Understand and Translate": Triple Supervision Decouples End-to-end Speech-to-text Translation

"Listen, Understand and Translate": Triple Supervision Decouples End-to-end Speech-to-text Translation

Consecutive Decoding for Speech-to-text Translation

BosphorusSign22k Sign Language Recognition Dataset

Contact Info

Product

Resources

About