Speechformer: Reducing Information Loss in Direct Speech Translation

Papi, Sara; Gaido, Marco; Negri, Matteo; Turchi, Marco

doi:10.18653/v1/2021.emnlp-main.127

Cited by 14 publications

(23 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Before training systems on huge corpora, we conduct preliminary experiments on the MuST-C benchmark to find a promising setting aimed at reducing the high computational costs of ST. First, we validate on different architectures the finding of previous works Papi et al, 2021b) that ST models trained with an additional CTC loss do not need an initialization of the encoder with that of an ASR model. To this aim, we add a CTC loss whose targets are the lowercase transcripts without punctuation.…”

Section: Competitive St Without Pre-trainingmentioning

confidence: 56%

“…As a first step, we compare different architectures proposed for ST: ST-adapted Transformer (Wang et al, 2020b), Conformer (Gulati et al, 2020), and Speechformer (Papi et al, 2021b). In addition, we also test a composite architecture made of a first stack of 8 Speechformer layers and a second stack of 4 Conformer layers.…”

Section: Model Selectionmentioning

confidence: 99%

“…However, in a randomly initialized state, the CTC compression may actually not reduce the input sequence (or only slightly), leading to OOM errors caused by the quadratic memory complexity with respect to the sequence length of the Transformer layers. For this reason Papi et al (2021b) initialize their encoder layers up to the CTC-compression module with a pre-trained model. Since we aim at reducing the computational cost avoiding any pretraining, we introduce two methods that ensure a minimal compression factor of the input sequence after the CTC-compression:…”

Section: Model Selectionmentioning

confidence: 99%

See 2 more Smart Citations

Efficient yet Competitive Speech Translation: FBK@IWSLT2022

Gaido¹,

Papi²,

Fucci³

et al. 2022

Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)

Self Cite

View full text Add to dashboard Cite

The primary goal of this FBK's systems submission to the IWSLT 2022 offline and simultaneous speech translation tasks is to reduce model training costs without sacrificing translation quality. As such, we first question the need of ASR pre-training, showing that it is not essential to achieve competitive results. Second, we focus on data filtering, showing that a simple method that looks at the ratio between source and target characters yields a quality improvement of 1 BLEU. Third, we compare different methods to reduce the detrimental effect of the audio segmentation mismatch between training data manually segmented at sentence level and inference data that is automatically segmented. Towards the same goal of training cost reduction, we participate in the simultaneous task with the same model trained for offline ST. The effectiveness of our lightweight training strategy is shown by the high score obtained on the MuST-C ende corpus (26.7 BLEU) and is confirmed in high-resource data conditions by a 1.6 BLEU improvement on the IWSLT2020 test set over last year's winning system.

show abstract

Section: Competitive St Without Pre-trainingmentioning

confidence: 56%

Section: Model Selectionmentioning

confidence: 99%

Section: Model Selectionmentioning

confidence: 99%

See 1 more Smart Citation

Efficient yet Competitive Speech Translation: FBK@IWSLT2022

Gaido¹,

Papi²,

Fucci³

et al. 2022

Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)

Self Cite

View full text Add to dashboard Cite

show abstract

“…In the last years, many architectures have been proposed to address the offline ST task Inaguma et al, 2020;Le et al, 2020;Papi et al, 2021). Among them, the novel Conformer (Gulati et al, 2020) has recently shown impressive results both in speech recognition, for which it was first proposed, and in speech translation (Inaguma et al, 2021).…”

Section: Scaling Architecturementioning

confidence: 99%

Does Simultaneous Speech Translation need Simultaneous Models?

Papi¹,

Gaido²,

Negri³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

In simultaneous speech translation (SimulST), finding the best trade-off between high translation quality and low latency is a challenging task. To meet the latency constraints posed by the different application scenarios, multiple dedicated SimulST models are usually trained and maintained, generating high computational costs. In this paper, motivated by the increased social and environmental impact caused by these costs, we investigate whether a single model trained offline can serve not only the offline but also the simultaneous task without the need for any additional training or adaptation. Experiments on en→{de, es} indicate that, aside from facilitating the adoption of well-established offline techniques and architectures without affecting latency, the offline solution achieves similar or better translation quality compared to the same model trained in simultaneous settings, as well as being competitive with the SimulST state of the art.

show abstract

“…Baseline Models In Table 1, we compared our method with end-to-end baseline models whose audio inputs are 80-channel log Mel-filter bank, including: FairseqST (Wang et al, 2020a), NeurST (Zhao et al, 2021a), Espnet ST (Inaguma et al, 2020), Dual-decoder Transformer (Le et al, 2020), SATE , Speechformer (Papi et al, 2021), self training and mutual learning (Zhao et al, 2021b) method, STAST , bi-KD (Inaguma et al, 2021), MLT method (Tang et al, 2021b), Lightweight Adaptor (Le et al, 2021), JT-S-MT (Tang et al, 2021a), FAT-ST , TaskAware (Indurthi et al, 2021), and STPT (Tang et al, 2022). We also compare our method to baseline models that have pretrained Wav2vec2.0 as a module, including:…”

Section: B Experimental Detailsmentioning

confidence: 99%

Cross-modal Contrastive Learning for Speech Translation

Ye¹,

Wang²,

Li³

2022

Preprint

View full text Add to dashboard Cite

How can we learn unified representations for spoken utterances and their written text? Learning similar representations for semantically similar speech and text is important for speech translation. To this end, we propose ConST, a cross-modal contrastive learning method for end-to-end speech-to-text translation. We evaluate ConST and a variety of previous baselines on a popular benchmark MuST-C. Experiments show that the proposed ConST consistently outperforms the previous methods on, and achieves an average BLEU of 29.4. The analysis further verifies that ConST indeed closes the representation gap of different modalities -its learned representation improves the accuracy of cross-modal speechtext retrieval from 4% to 88%. Code and models are available at https://github. com/ReneeYe/ConST.

show abstract

Speechformer: Reducing Information Loss in Direct Speech Translation

Cited by 14 publications

References 23 publications

Efficient yet Competitive Speech Translation: FBK@IWSLT2022

Efficient yet Competitive Speech Translation: FBK@IWSLT2022

Does Simultaneous Speech Translation need Simultaneous Models?

Cross-modal Contrastive Learning for Speech Translation

Contact Info

Product

Resources

About