TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech

Liu, Andy T.; Li, Shang-Wen; Lee, Hung-yi

doi:10.48550/arxiv.2007.06028

Cited by 12 publications

(26 citation statements)

References 39 publications

(149 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Deep Neural Networks have constantly pushed the state-of-the-art in speech technologies, for example automatic speech recognition (ASR) [2,3,4,5,6,7], pretrained speech transformers [8,9,10,11], dialect, language and speaker identification [12,13,14,15,16,17,18] models; along with other fields in Artificial Intelligence, including Natural Language Processing (NLP) [19] and Computer Vision (CV) [20]. While end-to-end deep architectures are simple, elegant and provide a flexible training mechanism, they are inherently black-boxes.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

What do End-to-End Speech Models Learn about Speaker, Language and Channel Information? A Layer-wise and Neuron-level Analysis

Chowdhury¹,

Durrani²,

Ahmed³

2021

Preprint

View full text Add to dashboard Cite

End-to-end deep neural network architectures have pushed the state-of-the-art in speech technologies, as well as in other spheres of Artificial Intelligence, subsequently leading researchers to train more complex and deeper models. These improvements came at the cost of transparency. Deep neural networks are innately opaque and difficult to interpret, compared to the traditional handcrafted feature-based models. We no longer understand what features are learned within these deep models, where they are preserved, and how they inter-operate. Such an analysis is important for better understanding of the models, for debugging and to ensure fairness in ethical decision making. In this work, we analyze the representations trained within deep speech models, trained towards the task of speaker recognition, dialect identification and reconstruction of masked signals. Specifically, we carry a layer-and neuron-level analysis on the utterance-level representations captured within pretrained speech models for speaker, language and channel properties. We study the following questions: (i) is the information captured in the learned representations? (ii) where is it preserved and how is it distributed? and (iii) can we identify a minimal subset of network that posses this information. To answer these questions, we use a probing framework commonly called as diagnostic classifiers [1]. Our results reveal interesting findings such as: (i) channel and gender information is distributed across the network, ii) the information is redundantly distributed in neurons with respect to a task (up to 80% in some cases); (iii) complex properties such as dialectal information is encoded only in the task-oriented pretrained network, iv) and is localised in the upper layers; (v) we can extract a minimal subset of neurons encoding the pre-defined property; (vi) salient neurons are sometimes shared between properties; (vii) our analysis highlights presence of

show abstract

Section: Introductionmentioning

confidence: 99%

“…In this case, we used the official verification pairs to evaluate. 9 last accessed: April 10, 2020 10. Randomly selected ≈4 hours from each language.…”

mentioning

confidence: 99%

What do End-to-End Speech Models Learn about Speaker, Language and Channel Information? A Layer-wise and Neuron-level Analysis

Chowdhury¹,

Durrani²,

Ahmed³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Recently, self-supervised learning has shown great potential to empower a wide range of downstream tasks. For example, simSLR [2] in Computer Vision (CV) field provides comparable performance with supervised learning in image classification task; word and sentence representations learned from BERT [3], GPT [4] and their followers [5,6,7] maintain state-of-the-art results in multiple downstream Neural Language Processing (NLP) tasks; Speech representation extractors, like wav2vec [8,9] and TERA [10], provide more informative features and show significant performance improvement in downstream applications like Automatic Speech Recognition (ASR).…”

Section: Introductionmentioning

confidence: 99%

“…There are mainly three self-supervised learning paradigms in speech domain: Autoregressive Predictive Coding (APC) [11,12], Contrastive Predictive Coding (CPC) [13,8,9] and Masked Predictive Coding (MPC) [14,10], all of which try to encode semantic information (e.g., phonetic information) from contextual speech and output learned features for downstream tasks. Similar to autoregressive language model training in NLP domain, APC tries to predict future frames by encoding previous context in an autoregressive manner.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Layer Reduction: Accelerating Conformer-Based Self-Supervised Model via Layer Consistency

Tian,

Gu,

Wang

et al. 2021

Preprint

View full text Add to dashboard Cite

Transformer-based self-supervised models are trained as feature extractors and have empowered many downstream speech tasks to achieve state-of-the-art performance. However, both the training and inference process of these models may encounter prohibitively high computational cost and large parameter budget. Although Parameter Sharing Strategy (PSS) proposed in ALBERT paves the way for parameter reduction, the computation required remains the same. Interestingly, we found in experiments that distributions of feature embeddings from different Transformer layers are similar when PSS is integrated: a property termed as Layer Consistency (LC) in this paper. Given this similarity of feature distributions, we assume that feature embeddings from different layers would have similar representing power. In this work, Layer Consistency enables us to adopt Transformer-based models in a more efficient manner: the number of Conformer layers in each training iteration could be uniformly sampled and Shallow Layer Inference (SLI) could be applied to reduce the number of layers in inference stage. In experiments, our models are trained with LibriSpeech dataset and then evaluated on both phone classification and Speech Recognition tasks. We experimentally achieve 7.8X parameter reduction, 41.9% training speedup and 37.7% inference speedup while maintaining comparable performance with conventional BERT-like self-supervised methods.

show abstract

Assessing Schizophrenia Patients Through Linguistic and Acoustic Features Using Deep Learning Techniques

Huang

Lin

Liu

et al. 2022

IEEE Trans. Neural Syst. Rehabil. Eng.

View full text Add to dashboard Cite

Thought, language, and communication disorders are among the salient characteristics of schizophrenia. Such impairments are often exhibited in patients' conversations. Researches have shown that assessments of thought disorder are crucial for tracking the clinical patients' conditions and early detection of clinical high-risks. Detecting such symptoms require a trained clinician's expertise, which is prohibitive due to cost and the high patient-to-clinician ratio. In this paper, we propose a machine learning method using Transformer-based model to help automate the assessment of the severity of the thought disorder of schizophrenia. The proposed model uses both textual and acoustic speech between occupational therapists or psychiatric nurses and schizophrenia patients to predict the level of their thought disorder. Experimental results show that the proposed model has the ability to closely predict the results of assessments for Schizophrenia patients base on the extracted semantic, syntactic and acoustic features. Thus, we believe our model can be a helpful tool to doctors when they are assessing schizophrenia patients.

show abstract

TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech

Cited by 12 publications

References 39 publications

What do End-to-End Speech Models Learn about Speaker, Language and Channel Information? A Layer-wise and Neuron-level Analysis

What do End-to-End Speech Models Learn about Speaker, Language and Channel Information? A Layer-wise and Neuron-level Analysis

Layer Reduction: Accelerating Conformer-Based Self-Supervised Model via Layer Consistency

Assessing Schizophrenia Patients Through Linguistic and Acoustic Features Using Deep Learning Techniques

Contact Info

Product

Resources

About