Dealing with training and test segmentation mismatch: FBK@IWSLT2021

Papi, Sara; Gaido, Marco; Negri, Matteo; Turchi, Marco

doi:10.18653/v1/2021.iwslt-1.8

Cited by 7 publications

(11 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…knowledge from the easier MT task, 2 in which models obtain better performance, and hence improve the quality of the resulting ST student model. (Gaido et al 2020a;Papi et al 2021), instead, leverage KD from an MT model trained on a large amount of data to distill into the ST student model information that such a model could not directly access because of the different input modality. All these works employ the Word-KD method.…”

Section: Knowledge Distillation In Stmentioning

confidence: 99%

Direct Speech-to-Text Translation Models as Students of Text-to-Text Models

Gaido¹,

Negri²,

Turchi³

2022

ijcol

Self Cite

View full text Add to dashboard Cite

Direct speech-to-text translation (ST) is an emerging approach that consists in performing the ST task with a single neural model. Although this paradigm comes with the promise to outperform the traditional pipeline systems, its rise is still limited by the paucity of speech-translation paired corpora compared to the large amount of speech-transcript and parallel bilingual corpora available to train previous solutions. As such, the research community focused on techniques to transfer knowledge from automatic speech recognition (ASR) and machine translation (MT) models trained on huge datasets. In this paper, we extend and integrate our recent work (Gaido et al. 2020b) analysing the best performing approach to transfer learning from MT, which is represented by knowledge distillation (KD) in sequence-to-sequence models. After the comparison of the different KD methods to understand which one is the most effective, we extend our previous analysis of the effects -both in terms of benefits and drawbacks -to different language pairs in high-resource conditions, ensuring the generalisability of our findings. Altogether, these extensions complement and complete our investigation on KD for speech translation leading to the following overall findings: i) the best training recipe involves a word-level KD training followed by a fine-tuning step on the ST task, ii) word-level KD from MT can be detrimental for gender translation and can lead to output truncation (though these problems are alleviated by the fine-tuning on the ST task), and iii) the quality of the ST student model strongly depends on the quality of the MT teacher model, although the correlation is not linear.

show abstract

Section: Knowledge Distillation In Stmentioning

confidence: 99%

Direct Speech-to-Text Translation Models as Students of Text-to-Text Models

Gaido¹,

Negri²,

Turchi³

2022

ijcol

Self Cite

View full text Add to dashboard Cite

show abstract

“…Context-aware ST models have been shown to be robust towards error-prone automatic segmentations of the test set at inference time (Zhang et al, 2021a). Our method bears similarities with Gaido et al (2020b); Papi et al (2021) in that it re-segments the train set to create synthetic data. However, unlike their approach, where they split at random words in the transcript, we use a specialized Audio Segmentation method (Tsiamas et al, 2022b) to directly split the audio into segments resembling proper sentences.…”

Section: Relevant Researchmentioning

confidence: 99%

SegAugment: Maximizing the Utility of Speech Translation Data with Segmentation-based Augmentations

Tsiamas,

Fonollosa,

Costa-jussà

2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

End-to-end Speech Translation is hindered by a lack of available data resources. While most of them are based on documents, a sentencelevel version is available, which is however single and static, potentially impeding the usefulness of the data. We propose a new data augmentation strategy, SEGAUGMENT, to address this issue by generating multiple alternative sentence-level versions of a dataset. Our method utilizes an Audio Segmentation system, which re-segments the speech of each document with different length constraints, after which we obtain the target text via alignment methods. Experiments demonstrate consistent gains across eight language pairs in MuST-C, with an average increase of 2.5 BLEU points, and up to 5 BLEU for low-resource scenarios in mTEDx. Furthermore, when combined with a strong system, SEGAUGMENT obtains stateof-the-art results in MuST-C. Finally, we show that the proposed method can also successfully augment sentence-level datasets, and that it enables Speech Translation models to close the gap between the manual and automatic segmentation at inference time.

show abstract

“…As such, our training set comprised the synthetic data built using SeqKD and the native ST data, both filtered with the method described in Section 2.2. The two types of data were distinguished by means of a tag pre-pended to the target text (Gaido et al, 2020b;Papi et al, 2021a).…”

Section: Datamentioning

confidence: 99%

“…We add the CTC loss in the 8th encoder layer sincePapi et al, 2021a) has demonstrated that it compares favourably with adding the CTC on top of the encoder outputs or in other layers(Bahar et al, 2019).…”

mentioning

confidence: 99%

“…As such, when fed with an unsegmented audio stream, they suffer from the mismatch between the training and inference data, which often results in significant performance drops. Accordingly, our last year submission(Papi et al, 2021a) focused on reducing the impact of this distributional shift, both by increasing the robustness of the model with a fine-tuning on a random re-segmentation of the MuST-C training set(Gaido et al, 2020a), and by means of a hybrid method for…”

mentioning

confidence: 99%

See 1 more Smart Citation

Efficient yet Competitive Speech Translation: FBK@IWSLT2022

Gaido¹,

Papi²,

Fucci³

et al. 2022

Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)

Self Cite

View full text Add to dashboard Cite

The primary goal of this FBK's systems submission to the IWSLT 2022 offline and simultaneous speech translation tasks is to reduce model training costs without sacrificing translation quality. As such, we first question the need of ASR pre-training, showing that it is not essential to achieve competitive results. Second, we focus on data filtering, showing that a simple method that looks at the ratio between source and target characters yields a quality improvement of 1 BLEU. Third, we compare different methods to reduce the detrimental effect of the audio segmentation mismatch between training data manually segmented at sentence level and inference data that is automatically segmented. Towards the same goal of training cost reduction, we participate in the simultaneous task with the same model trained for offline ST. The effectiveness of our lightweight training strategy is shown by the high score obtained on the MuST-C ende corpus (26.7 BLEU) and is confirmed in high-resource data conditions by a 1.6 BLEU improvement on the IWSLT2020 test set over last year's winning system.

show abstract

Dealing with training and test segmentation mismatch: FBK@IWSLT2021

Cited by 7 publications

References 33 publications

Direct Speech-to-Text Translation Models as Students of Text-to-Text Models

Direct Speech-to-Text Translation Models as Students of Text-to-Text Models

SegAugment: Maximizing the Utility of Speech Translation Data with Segmentation-based Augmentations

Efficient yet Competitive Speech Translation: FBK@IWSLT2022

Contact Info

Product

Resources

About