Non-Autoregressive Predictive Coding for Learning Speech Representations from Local Dependencies

Liu, Alexander H.; Chung, Yu-An; Glass, James

doi:10.48550/arxiv.2011.00406

Cited by 7 publications

(9 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Network #Params Stride Input Corpus Pretraining Official Github FBANK -0 10ms waveform ---PASE+ [16] SincNet, 7-Conv, 1-QRNN 7.83M 10ms waveform LS 50 hr multi-task santi-pdp / pase APC [7] 3-GRU 4.11M 10ms FBANK LS 360 hr F-G iamyuanchung / APC VQ-APC [32] 3-GRU 4.63M 10ms FBANK LS 360 hr F-G + VQ iamyuanchung / VQ-APC NPC [33] 4-Conv, 4-Masked Conv 19.38M 10ms FBANK LS 360 hr M-G + VQ Alexander-H-Liu / NPC Mockingjay [8] 12-Trans 85.12M 10ms FBANK LS 360 hr time M-G s3prl / s3prl TERA [9] 3-Trans 21.33M 10ms FBANK LS 960 hr time/freq M-G s3prl / s3prl modified CPC [34] 5-Conv, 1-LSTM 1.84M 10ms waveform LL 60k hr F-C facebookresearch / CPC audio wav2vec [12] 19-Conv 32.54M 10ms waveform LS 960 hr F-C pytorch / fairseq vq-wav2vec [13] 20 2.0 did not officially release the fixed representation usage. We extract the last-layer representation for the Base model as was done in decoar 2.0 [10], which showed promising ASR results.…”

Section: Methodsmentioning

confidence: 99%

“…Generative modeling has long been a prevailing approach to learn speech representation [7,8,10]. Instances of generative modeling investigated here include APC [7], VQ-APC [32], Mockingjay [8], TERA [9], and NPC [33]. APC adopts the language model-like pretraining scheme on a sequence of acoustic features (FBANK) with unidirectional RNN and generates future frames conditioning on past frames.…”

Section: Framework: Universal Representationmentioning

confidence: 99%

See 1 more Smart Citation

SUPERB: Speech processing Universal PERformance Benchmark

Yang¹,

Chi²,

Chuang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV). The paradigm pretrains a shared model on large volumes of unlabeled data and achieves state-of-the-art (SOTA) for various tasks with minimal adaptation. However, the speech processing community lacks a similar setup to systematically explore the paradigm. To bridge this gap, we introduce Speech processing Universal PERformance Benchmark (SUPERB). SUPERB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks with minimal architecture changes and labeled data. Among multiple usages of the shared model, we especially focus on extracting the representation learned from SSL due to its preferable re-usability. We present a simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model. Our results demonstrate that the framework is promising as SSL representations show competitive generalizability and accessibility across SUPERB tasks. We release SUPERB as a challenge with a leaderboard 1 and a benchmark toolkit 2 to fuel the research in representation learning and general speech processing.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Framework: Universal Representationmentioning

confidence: 99%

SUPERB: Speech processing Universal PERformance Benchmark

Yang¹,

Chi²,

Chuang³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In addition to the knowledge-based speech feature set, we propose to evaluate our framework on SU-PERB (Speech Processing Universal PERformance Benchmark) [24], which is designed to provide a standard and comprehensive testbed for pre-trained models on various downstream speech tasks. We compute the deep speech representations from the pretrained models that are available in SUPERB including APC [25], Vq-APC [26], Tera [27], NPC [28], and DeCoAR 2.0 [29]. We further compute the global average of the last layer's hidden state as the final feature from the pre-trained model's output.…”

Section: Data Preprocessingmentioning

confidence: 99%

“…We further compute the global average of the last layer's hidden state as the final feature from the pre-trained model's output. Using the last hidden state is suggested in prior works for downstream tasks [25], [28], [29], [30]. In summary, the feature sizes are 988 in Emo-Base; 512 in APC, Vq-APC, and NPC; 768 in Tera and DeCoAR 2.0.…”

Section: Data Preprocessingmentioning

confidence: 99%

Attribute Inference Attack of Speech Emotion Recognition in Federated Learning Settings

Feng¹,

Hashemi²,

Hebbar³

et al. 2021

Preprint

View full text Add to dashboard Cite

Speech emotion recognition (SER) processes speech signals to detect and characterize expressed perceived emotions. Many SER application systems often acquire and transmit speech data collected at the client-side to remote cloud platforms for inference and decision making. However, speech data carry rich information not only about emotions conveyed in vocal expressions, but also other sensitive demographic traits such as gender, age and language background. Consequently, it is desirable for SER systems to have the ability to classify emotion constructs while preventing unintended/improper inferences of sensitive and demographic information. Federated learning (FL) is a distributed machine learning paradigm that coordinates clients to train a model collaboratively without sharing their local data. This training approach appears secure and can improve privacy for SER. However, recent works have demonstrated that FL approaches are still vulnerable to various privacy attacks like reconstruction attacks and membership inference attacks. Although most of these have focused on computer vision applications, such information leakages exist in the SER systems trained using the FL technique. To assess the information leakage of SER systems trained using FL, we propose an attribute inference attack framework that infers sensitive attribute information of the clients from shared gradients or model parameters, corresponding to the FedSGD and the FedAvg training algorithms, respectively. As a use case, we empirically evaluate our approach for predicting the client's gender information using three SER benchmark datasets: IEMOCAP, CREMA-D, and MSP-Improv. We show that the attribute inference attack is achievable for SER systems trained using FL. We further identify that most information leakage possibly comes from the first layer in the SER model.

show abstract

“…Finally, we calculate the global average of the last layer's hidden state as the final feature from the pre-trained model's output. Using the last hidden state is suggested in prior works for downstream tasks [21,26,23,27]. Our feature sizes are 988 in Emo-Base; 512 in APC; 768 in Tera, DistilHuBERT, and DeCoAR 2.0.…”

mentioning

confidence: 99%

Semi-FedSER: Semi-supervised Learning for Speech Emotion Recognition On Federated Learning using Multiview Pseudo-Labeling

Feng¹,

Narayanan²

2022

Preprint

View full text Add to dashboard Cite

Speech Emotion Recognition (SER) application is frequently associated with privacy concerns as it often acquires and transmits speech data at the client-side to remote cloud platforms for further processing. These speech data can reveal not only speech content and affective information but the speaker's identity, demographic traits, and health status. Federated learning (FL) is a distributed machine learning algorithm that coordinates clients to train a model collaboratively without sharing local data. This algorithm shows enormous potential for SER applications as sharing raw speech or speech features from a user's device is vulnerable to privacy attacks. However, a major challenge in FL is limited availability of high-quality labeled data samples. In this work, we propose a semi-supervised federated learning framework, Semi-FedSER, that utilizes both labeled and unlabeled data samples to address the challenge of limited labeled data samples in FL. We show that our Semi-FedSER can generate desired SER performance even when the local label rate l = 20% using two SER benchmark datasets: IEMOCAP and MSP-Improv.

show abstract

Non-Autoregressive Predictive Coding for Learning Speech Representations from Local Dependencies

Cited by 7 publications

References 0 publications

SUPERB: Speech processing Universal PERformance Benchmark

SUPERB: Speech processing Universal PERformance Benchmark

Attribute Inference Attack of Speech Emotion Recognition in Federated Learning Settings

Semi-FedSER: Semi-supervised Learning for Speech Emotion Recognition On Federated Learning using Multiview Pseudo-Labeling

Contact Info

Product

Resources

About