Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021
DOI: 10.18653/v1/2021.emnlp-main.127
|View full text |Cite
|
Sign up to set email alerts
|

Speechformer: Reducing Information Loss in Direct Speech Translation

Abstract: Transformer-based models have gained increasing popularity achieving state-of-the-art performance in many research fields including speech translation. However, Transformer's quadratic complexity with respect to the input sequence length prevents its adoption as is with audio signals, which are typically represented by long sequences. Current solutions resort to an initial sub-optimal compression based on a fixed sampling of raw audio features. Therefore, potentially useful linguistic information is not access… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 14 publications
(23 citation statements)
references
References 23 publications
0
8
0
Order By: Relevance
“…Before training systems on huge corpora, we conduct preliminary experiments on the MuST-C benchmark to find a promising setting aimed at reducing the high computational costs of ST. First, we validate on different architectures the finding of previous works Papi et al, 2021b) that ST models trained with an additional CTC loss do not need an initialization of the encoder with that of an ASR model. To this aim, we add a CTC loss whose targets are the lowercase transcripts without punctuation.…”
Section: Competitive St Without Pre-trainingmentioning
confidence: 56%
See 2 more Smart Citations
“…Before training systems on huge corpora, we conduct preliminary experiments on the MuST-C benchmark to find a promising setting aimed at reducing the high computational costs of ST. First, we validate on different architectures the finding of previous works Papi et al, 2021b) that ST models trained with an additional CTC loss do not need an initialization of the encoder with that of an ASR model. To this aim, we add a CTC loss whose targets are the lowercase transcripts without punctuation.…”
Section: Competitive St Without Pre-trainingmentioning
confidence: 56%
“…As a first step, we compare different architectures proposed for ST: ST-adapted Transformer (Wang et al, 2020b), Conformer (Gulati et al, 2020), and Speechformer (Papi et al, 2021b). In addition, we also test a composite architecture made of a first stack of 8 Speechformer layers and a second stack of 4 Conformer layers.…”
Section: Model Selectionmentioning
confidence: 99%
See 1 more Smart Citation
“…In the last years, many architectures have been proposed to address the offline ST task Inaguma et al, 2020;Le et al, 2020;Papi et al, 2021). Among them, the novel Conformer (Gulati et al, 2020) has recently shown impressive results both in speech recognition, for which it was first proposed, and in speech translation (Inaguma et al, 2021).…”
Section: Scaling Architecturementioning
confidence: 99%
“…Baseline Models In Table 1, we compared our method with end-to-end baseline models whose audio inputs are 80-channel log Mel-filter bank, including: FairseqST (Wang et al, 2020a), NeurST (Zhao et al, 2021a), Espnet ST (Inaguma et al, 2020), Dual-decoder Transformer (Le et al, 2020), SATE , Speechformer (Papi et al, 2021), self training and mutual learning (Zhao et al, 2021b) method, STAST , bi-KD (Inaguma et al, 2021), MLT method (Tang et al, 2021b), Lightweight Adaptor (Le et al, 2021), JT-S-MT (Tang et al, 2021a), FAT-ST , TaskAware (Indurthi et al, 2021), and STPT (Tang et al, 2022). We also compare our method to baseline models that have pretrained Wav2vec2.0 as a module, including:…”
Section: B Experimental Detailsmentioning
confidence: 99%