2020
DOI: 10.48550/arxiv.2007.06028
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech

Andy T. Liu,
Shang-Wen Li,
Hung-yi Lee

Abstract: We introduce a self-supervised speech pre-training method called TERA, which stands for Transformer Encoder Representations from Alteration. Recent approaches often learn through the formulation of a single auxiliary task like contrastive prediction, autoregressive prediction, or masked reconstruction. Unlike previous approaches, we use a multi-target auxiliary task to pre-train Transformer Encoders on a large amount of unlabeled speech. The model learns through the reconstruction of acoustic frames from its a… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
4

Citation Types

0
26
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
6
1

Relationship

3
4

Authors

Journals

citations
Cited by 12 publications
(26 citation statements)
references
References 39 publications
(149 reference statements)
0
26
0
Order By: Relevance
“…Deep Neural Networks have constantly pushed the state-of-the-art in speech technologies, for example automatic speech recognition (ASR) [2,3,4,5,6,7], pretrained speech transformers [8,9,10,11], dialect, language and speaker identification [12,13,14,15,16,17,18] models; along with other fields in Artificial Intelligence, including Natural Language Processing (NLP) [19] and Computer Vision (CV) [20]. While end-to-end deep architectures are simple, elegant and provide a flexible training mechanism, they are inherently black-boxes.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Deep Neural Networks have constantly pushed the state-of-the-art in speech technologies, for example automatic speech recognition (ASR) [2,3,4,5,6,7], pretrained speech transformers [8,9,10,11], dialect, language and speaker identification [12,13,14,15,16,17,18] models; along with other fields in Artificial Intelligence, including Natural Language Processing (NLP) [19] and Computer Vision (CV) [20]. While end-to-end deep architectures are simple, elegant and provide a flexible training mechanism, they are inherently black-boxes.…”
Section: Introductionmentioning
confidence: 99%
“…In this case, we used the official verification pairs to evaluate. 9 last accessed: April 10, 2020 10. Randomly selected ≈4 hours from each language.…”
mentioning
confidence: 99%
“…Recently, self-supervised learning has shown great potential to empower a wide range of downstream tasks. For example, simSLR [2] in Computer Vision (CV) field provides comparable performance with supervised learning in image classification task; word and sentence representations learned from BERT [3], GPT [4] and their followers [5,6,7] maintain state-of-the-art results in multiple downstream Neural Language Processing (NLP) tasks; Speech representation extractors, like wav2vec [8,9] and TERA [10], provide more informative features and show significant performance improvement in downstream applications like Automatic Speech Recognition (ASR).…”
Section: Introductionmentioning
confidence: 99%
“…There are mainly three self-supervised learning paradigms in speech domain: Autoregressive Predictive Coding (APC) [11,12], Contrastive Predictive Coding (CPC) [13,8,9] and Masked Predictive Coding (MPC) [14,10], all of which try to encode semantic information (e.g., phonetic information) from contextual speech and output learned features for downstream tasks. Similar to autoregressive language model training in NLP domain, APC tries to predict future frames by encoding previous context in an autoregressive manner.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation