ON-TRAC’ systems for the IWSLT 2021 low-resource speech translation and multilingual speech translation shared tasks

Le, Hang; Barbier, Florentin; Nguyen, Ha Thanh; Tomashenko, Natalia; Mdhaffar, Salima; Gahbiche, Souhir; Lecouteux, Benjamin; Schwab, Didier; Estève, Yannick

doi:10.18653/v1/2021.iwslt-1.20

Cited by 2 publications

(5 citation statements)

References 17 publications

(14 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Optimal Transport (OT) (Peyré and Cuturi, 2019), traditionally used in NLP and MT (Chen et al, 2019;Alqahtani et al, 2021), has recently found its way into ST. Zhou et al (2023) used OT to find the alignment between speech and text features to apply Mixup. Le et al (2023) applied OT in a siamese pretraining setting in combination with CTC, yielding improvements compared to the standard ASR pretraining. Tsiamas et al (2023) extended this pretraining in the context of foundation models, while also freezing the text branch for better integration with the text decoder during ST finetuning.…”

Section: Optimal Transportmentioning

confidence: 99%

“…To align a speech representation h s ∈ R n ′ ×d to the text representation h x ∈ R m×d , we are minimizing their Wasserstein loss (Frogner et al, 2015) using Optimal Transport (OT) (Peyré and Cuturi, 2019), as in Le et al (2023); Tsiamas et al (2023). We assume two uniform probability distributions ϕ s , ϕ x , with ϕ s i = 1/n ′ and ϕ x j = 1/m, that define the mass of each position in the speech and text representations.…”

Section: Optimal Transportmentioning

confidence: 99%

“…In practice, to make the Wasserstein distance differentiable and efficient to compute using the Sinkhorn algorithm (Knopp and Sinkhorn, 1967), we use its entropy-regularized upper-bound estimation. Furthermore, since the Wasserstein distance is permutation invariant, we follow Le et al (2023) and incorporate two positional regularization vectors v s ∈ R n ′ and v x ∈ R m in h s and h x accordingly, thus penalizing the transportation of mass between two distant positions i and j.…”

Section: Optimal Transportmentioning

confidence: 99%

“…To ease the representation gap, previous works have proposed a shared encoder (Ye et al, 2021) and optimizing distance metrics to bring the speech-text representation closer to the shared semantic space (Fang et al, 2022;. Due to its ability to handle representations of different lengths, the Wasserstein distance (Frogner et al, 2015) using Optimal Transport, has emerged as an effective distance metric, specifically during pretraining (Le et al, 2023;Tsiamas et al, 2023). Although these methods show promise in reducing the modality gap and increasing translation quality, they still require at least some ST data.…”

Section: Introductionmentioning

confidence: 99%

“…Methods to reduce the length discrepancy usually include sub-sampling the speech representation using convolutional length adaptors Gállego et al, 2021;Fang et al, 2022;Zhao et al, 2022) or character/phonemebased CTC compression Xu et al, 2021a). Several methods have also used phonemized text in order to better match the representations in both length and content (Tang et al, 2021a(Tang et al, , 2022Le et al, 2023), but potentially limiting the quality of the text branch due to noise. In this work, we introduce a novel CTC-based compression to subword units, directly aligning with the tokenization of MT models.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

SHAS: Approaching optimal Segmentation for End-to-End Speech Translation

Tsiamas¹,

Gállego²,

Fonollosa³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

Data scarcity and the modality gap between the speech and text modalities are two major obstacles of end-to-end Speech Translation (ST) systems, thus hindering their performance. Prior work has attempted to mitigate these challenges by leveraging external MT data and optimizing distance metrics that bring closer the speech-text representations. However, achieving competitive results typically requires some ST data. For this reason, we introduce ZEROS-WOT, a method for zero-shot ST that bridges the modality gap without any paired ST data. Leveraging a novel CTC compression and Optimal Transport, we train a speech encoder using only ASR data, to align with the representation space of a massively multilingual MT model. The speech encoder seamlessly integrates with the MT model at inference, enabling direct translation from speech to text, across all languages supported by the MT model. Our experiments show that we can effectively close the modality gap without ST data, while our results on MUST-C and COVOST demonstrate our method's superiority over not only previous zero-shot models, but also supervised ones, achieving state-of-the-art results. 1

show abstract