2022
DOI: 10.4000/ijcol.959
|View full text |Cite
|
Sign up to set email alerts
|

Direct Speech-to-Text Translation Models as Students of Text-to-Text Models

Abstract: Direct speech-to-text translation (ST) is an emerging approach that consists in performing the ST task with a single neural model. Although this paradigm comes with the promise to outperform the traditional pipeline systems, its rise is still limited by the paucity of speech-translation paired corpora compared to the large amount of speech-transcript and parallel bilingual corpora available to train previous solutions. As such, the research community focused on techniques to transfer knowledge from automatic s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

1
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 63 publications
1
2
0
Order By: Relevance
“…That 'architecture' comprises three components or sequential stages: speech recognition (SR), interlingual transfer, and speech synthesis. This three-stage conception of the process, sometimes referred to as the cascade(d) or pipeline model (e.g., Gaido et al 2022), is in fact congruent with what Herbert (1952, 10) had envisioned for the interpreting process in his seminal handbook: "understanding" -"transference" -"speaking". In computer systems, however, the first stage is a conversion of the speech stream into written text, which then serves as input to the central machine translation component, the output of which in turn serves as input for the final stage of text-to-speech (TTS) synthesis.…”
Section: Research To Datesupporting
confidence: 65%
“…That 'architecture' comprises three components or sequential stages: speech recognition (SR), interlingual transfer, and speech synthesis. This three-stage conception of the process, sometimes referred to as the cascade(d) or pipeline model (e.g., Gaido et al 2022), is in fact congruent with what Herbert (1952, 10) had envisioned for the interpreting process in his seminal handbook: "understanding" -"transference" -"speaking". In computer systems, however, the first stage is a conversion of the speech stream into written text, which then serves as input to the central machine translation component, the output of which in turn serves as input for the final stage of text-to-speech (TTS) synthesis.…”
Section: Research To Datesupporting
confidence: 65%
“…One approach to data augmentation is to apply knowledge distillation (KD), which was introduced to transfer knowledge from big to small models (Hinton et al, 2015). Among the possible methods, sequence-level KD (Kim and Rush, 2016) is one of the most popular ones in ST thanks to its application simplicity and the consistent improvements observed (Potapczyk and Przybysz, 2020;Xu et al, 2021;Gaido et al, 2022a). Sequence-level KD consists of replacing the target references of a given parallel training corpus with the predicted sequences generated by a teacher model (usually, an MT model), from which we want to distill the knowledge to a student model.…”
Section: Scaling Datamentioning
confidence: 99%
“…Alongside the increased interest in the SimulST task, especially during the last year, we have witnessed an explosion in the use of large models (Latif et al, 2023), including speech foundation models (Radford et al, 2023;Pratap et al, 2023;Barrault et al, 2023a;. These models are now commonly used alone or in combination with large language models (Gaido et al, 2024) for generic ST tasks. Among these, Seam-lessM4T (Barrault et al, 2023a) has emerged as one of the most promising multimodal and multilingual models, covering more than 143 source languages and 200 target languages.…”
Section: Introductionmentioning
confidence: 99%