Direct Speech-to-Text Translation Models as Students of Text-to-Text Models

Gaido, Marco; Negri, Matteo; Turchi, Marco

doi:10.4000/ijcol.959

Cited by 3 publications

(3 citation statements)

References 63 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…That 'architecture' comprises three components or sequential stages: speech recognition (SR), interlingual transfer, and speech synthesis. This three-stage conception of the process, sometimes referred to as the cascade(d) or pipeline model (e.g., Gaido et al 2022), is in fact congruent with what Herbert (1952, 10) had envisioned for the interpreting process in his seminal handbook: "understanding" -"transference" -"speaking". In computer systems, however, the first stage is a conversion of the speech stream into written text, which then serves as input to the central machine translation component, the output of which in turn serves as input for the final stage of text-to-speech (TTS) synthesis.…”

Section: Research To Datesupporting

confidence: 65%

Is machine interpreting interpreting?

Pöchhacker

2024

View full text Add to dashboard Cite

This article first considers the question whether machine translation is translation and moves on to address the analogous issue for interpreting. After a review of the development and state of the art in machine interpreting, more commonly referred to as ‘spoken language translation’ or ‘speech translation’, the question of whether machine interpreting is interpreting is discussed – first with regard to terminology and conceptual distinctions and then in broader translation-theoretical frameworks. Using Otto Kade’s early definitional proposal as a point of departure, a reconceptualization is proposed in the form of a three-dimensional model designed to go beyond rigid taxonomies. The dimensions of agency, embodiment and immediacy are used to characterize translation as a graded concept in which these features may be more or less prominent.

show abstract

Section: Research To Datesupporting

confidence: 65%

Is machine interpreting interpreting?

Pöchhacker

2024

View full text Add to dashboard Cite

show abstract

“…One approach to data augmentation is to apply knowledge distillation (KD), which was introduced to transfer knowledge from big to small models (Hinton et al, 2015). Among the possible methods, sequence-level KD (Kim and Rush, 2016) is one of the most popular ones in ST thanks to its application simplicity and the consistent improvements observed (Potapczyk and Przybysz, 2020;Xu et al, 2021;Gaido et al, 2022a). Sequence-level KD consists of replacing the target references of a given parallel training corpus with the predicted sequences generated by a teacher model (usually, an MT model), from which we want to distill the knowledge to a student model.…”

Section: Scaling Datamentioning

confidence: 99%

Does Simultaneous Speech Translation need Simultaneous Models?

Papi¹,

Gaido²,

Negri³

et al. 2022

Findings of the Association for Computational Linguistics: EMNLP 2022

View full text Add to dashboard Cite

In simultaneous speech translation (SimulST), finding the best trade-off between high output quality and low latency is a challenging task. To meet the latency constraints posed by different application scenarios, multiple dedicated SimulST models are usually trained and maintained, generating high computational costs. In this paper, also motivated by the increased sensitivity towards sustainable AI, we investigate whether a single model trained offline can serve both offline and simultaneous applications under different latency regimes without additional training or adaptation. Experiments on en→{de, es} show that, aside from facilitating the adoption of well-established offline architectures and training strategies without affecting latency, the offline solution achieves similar or better quality compared to the standard SimulST training protocol, also being competitive with the state-of-the-art system.

show abstract

“…Alongside the increased interest in the SimulST task, especially during the last year, we have witnessed an explosion in the use of large models (Latif et al, 2023), including speech foundation models (Radford et al, 2023;Pratap et al, 2023;Barrault et al, 2023a;. These models are now commonly used alone or in combination with large language models (Gaido et al, 2024) for generic ST tasks. Among these, Seam-lessM4T (Barrault et al, 2023a) has emerged as one of the most promising multimodal and multilingual models, covering more than 143 source languages and 200 target languages.…”

Section: Introductionmentioning

confidence: 99%

Over-Generation Cannot Be Rewarded: Length-Adaptive Average Lagging for Simultaneous Speech Translation

Papi¹,

Gaido²,

Negri³

et al. 2022

Proceedings of the Third Workshop on Automatic Simultaneous Translation

View full text Add to dashboard Cite

Simultaneous speech translation (SimulST) systems aim at generating their output with the lowest possible latency, which is normally computed in terms of Average Lagging (AL). In this paper we highlight that, despite its widespread adoption, AL provides underestimated scores for systems that generate longer predictions compared to the corresponding references. We also show that this problem has practical relevance, as recent SimulST systems have indeed a tendency to over-generate. As a solution, we propose LAAL (Length-Adaptive Average Lagging), a modified version of the metric that takes into account the over-generation phenomenon and allows for unbiased evaluation of both under-/overgenerating systems.

show abstract

Direct Speech-to-Text Translation Models as Students of Text-to-Text Models

Cited by 3 publications

References 63 publications

Is machine interpreting interpreting?

Is machine interpreting interpreting?

Does Simultaneous Speech Translation need Simultaneous Models?

Over-Generation Cannot Be Rewarded: Length-Adaptive Average Lagging for Simultaneous Speech Translation

Contact Info

Product

Resources

About