2020
DOI: 10.48550/arxiv.2011.00406
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Non-Autoregressive Predictive Coding for Learning Speech Representations from Local Dependencies

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
9
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 7 publications
(9 citation statements)
references
References 0 publications
0
9
0
Order By: Relevance
“…Network #Params Stride Input Corpus Pretraining Official Github FBANK -0 10ms waveform ---PASE+ [16] SincNet, 7-Conv, 1-QRNN 7.83M 10ms waveform LS 50 hr multi-task santi-pdp / pase APC [7] 3-GRU 4.11M 10ms FBANK LS 360 hr F-G iamyuanchung / APC VQ-APC [32] 3-GRU 4.63M 10ms FBANK LS 360 hr F-G + VQ iamyuanchung / VQ-APC NPC [33] 4-Conv, 4-Masked Conv 19.38M 10ms FBANK LS 360 hr M-G + VQ Alexander-H-Liu / NPC Mockingjay [8] 12-Trans 85.12M 10ms FBANK LS 360 hr time M-G s3prl / s3prl TERA [9] 3-Trans 21.33M 10ms FBANK LS 960 hr time/freq M-G s3prl / s3prl modified CPC [34] 5-Conv, 1-LSTM 1.84M 10ms waveform LL 60k hr F-C facebookresearch / CPC audio wav2vec [12] 19-Conv 32.54M 10ms waveform LS 960 hr F-C pytorch / fairseq vq-wav2vec [13] 20 2.0 did not officially release the fixed representation usage. We extract the last-layer representation for the Base model as was done in decoar 2.0 [10], which showed promising ASR results.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Network #Params Stride Input Corpus Pretraining Official Github FBANK -0 10ms waveform ---PASE+ [16] SincNet, 7-Conv, 1-QRNN 7.83M 10ms waveform LS 50 hr multi-task santi-pdp / pase APC [7] 3-GRU 4.11M 10ms FBANK LS 360 hr F-G iamyuanchung / APC VQ-APC [32] 3-GRU 4.63M 10ms FBANK LS 360 hr F-G + VQ iamyuanchung / VQ-APC NPC [33] 4-Conv, 4-Masked Conv 19.38M 10ms FBANK LS 360 hr M-G + VQ Alexander-H-Liu / NPC Mockingjay [8] 12-Trans 85.12M 10ms FBANK LS 360 hr time M-G s3prl / s3prl TERA [9] 3-Trans 21.33M 10ms FBANK LS 960 hr time/freq M-G s3prl / s3prl modified CPC [34] 5-Conv, 1-LSTM 1.84M 10ms waveform LL 60k hr F-C facebookresearch / CPC audio wav2vec [12] 19-Conv 32.54M 10ms waveform LS 960 hr F-C pytorch / fairseq vq-wav2vec [13] 20 2.0 did not officially release the fixed representation usage. We extract the last-layer representation for the Base model as was done in decoar 2.0 [10], which showed promising ASR results.…”
Section: Methodsmentioning
confidence: 99%
“…Generative modeling has long been a prevailing approach to learn speech representation [7,8,10]. Instances of generative modeling investigated here include APC [7], VQ-APC [32], Mockingjay [8], TERA [9], and NPC [33]. APC adopts the language model-like pretraining scheme on a sequence of acoustic features (FBANK) with unidirectional RNN and generates future frames conditioning on past frames.…”
Section: Framework: Universal Representationmentioning
confidence: 99%
“…In addition to the knowledge-based speech feature set, we propose to evaluate our framework on SU-PERB (Speech Processing Universal PERformance Benchmark) [24], which is designed to provide a standard and comprehensive testbed for pre-trained models on various downstream speech tasks. We compute the deep speech representations from the pretrained models that are available in SUPERB including APC [25], Vq-APC [26], Tera [27], NPC [28], and DeCoAR 2.0 [29]. We further compute the global average of the last layer's hidden state as the final feature from the pre-trained model's output.…”
Section: Data Preprocessingmentioning
confidence: 99%
“…We further compute the global average of the last layer's hidden state as the final feature from the pre-trained model's output. Using the last hidden state is suggested in prior works for downstream tasks [25], [28], [29], [30]. In summary, the feature sizes are 988 in Emo-Base; 512 in APC, Vq-APC, and NPC; 768 in Tera and DeCoAR 2.0.…”
Section: Data Preprocessingmentioning
confidence: 99%
“…Finally, we calculate the global average of the last layer's hidden state as the final feature from the pre-trained model's output. Using the last hidden state is suggested in prior works for downstream tasks [21,26,23,27]. Our feature sizes are 988 in Emo-Base; 512 in APC; 768 in Tera, DistilHuBERT, and DeCoAR 2.0.…”
mentioning
confidence: 99%