Domain-Informed Probing of wav2vec 2.0 Embeddings for Phonetic Features

Kelleher, John D.; Carson-Berndsen, Julie

doi:10.18653/v1/2022.sigmorphon-1.9

Cited by 4 publications

(7 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The results in the previous sections show that MLP-based classifiers perform adequately on the latent representations for all layers (Table 1, in line with [2,3]), while the internal phonephone structure shows a varying pattern across layers (Figs. 1, 2).…”

Section: Local and Global Structuresupporting

confidence: 57%

“…Importantly, this does not mean that those representations contain weaker phonetic information. After all, [2,3,4] have shown that classifiers that use these representations for a range of downstream tasks outperform the results of similar classifiers that operate on conventional spectral representations. The figure shows that the acoustic organisation in the higher transformer layers differs from the lowest transformer layer, in line with the findings in the next sections.…”

Section: Searching For Acoustic-phonetic Structurementioning

confidence: 99%

“…In our second method we trained MLP-based phone classifiers similar to the approach reported in other probing papers (e.g. [3]). First, we used both a pre-trained Wav2vec2 model without any fine-tuning, and a fine-tuned model trained on the core part of the read aloud books component (component o) in the Spoken Dutch Corpus [14].…”

Section: Multi-layer Perceptron Phone Classifiersmentioning

confidence: 99%

“…The recent emergence of so-called end-to-end systems (E2E), such as Wav2Vec 2.0 [1] (henceforth Wav2vec2), has revolutionized automatic speech recognition (ASR) in many ways. At the same time many researchers whose interest is not primarily to obtain the lowest possible transcription error rate are asking whether the representations on some or all layers of E2E models contain information that can be harnessed for other downstream tasks [2,3,4]. These approaches are collectively known as probing.…”

Section: Introductionmentioning

confidence: 99%

“…Many probing approaches take the representations on some layer of a deep neural network (DNN) as the data with which some classifier is trained for some specific task, such as phonetic feature extraction of phone classification. Previous research such as [2,3,4,9] mainly focused on identifying the layer whose representations yielded the best classification performance. These studies show that phone classification is possible with convincing performance.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Phonemic competition in end-to-end ASR models

Bosch¹,

Bentum²,

Boves³

2023

Interspeech 2023

View full text Add to dashboard Cite

Advanced end-to-end ASR systems encode speech signals by means of a multi-layer network architecture. In Wav2vec2.0, for example, a CNN is used as feature encoder on top of which transformer layers are used to map the high-dimensional CNN representations to the elements of some lexicon. Compared to the previous generation of 'modular' ASR systems it is much more difficult to interpret the processing and representations in an end-to-end system from a phonetic point of view. We built a Wav2vec2.0-based end-to-end system for producing broad phonetic transcriptions of Dutch. In this paper we investigate to what extent the CNN features and the representations on several transformer layers of a pre-trained and fine-tuned model reflect widely-shared phonetic knowledge. For that purpose we analyze distances between phones and the phonetic features of the most-activated phones in the output of an MLP classifier operating on the representations in several layers.

show abstract

Section: Local and Global Structuresupporting

confidence: 57%

Section: Searching For Acoustic-phonetic Structurementioning

confidence: 99%

Section: Multi-layer Perceptron Phone Classifiersmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Phonemic competition in end-to-end ASR models

Bosch¹,

Bentum²,

Boves³

2023

Interspeech 2023

View full text Add to dashboard Cite

show abstract

Fine-Tuning Self-Supervised Learning Models for End-to-End Pronunciation Scoring

Zahran,

Fahmy,

Wassif

et al. 2023

IEEE Access

View full text Add to dashboard Cite

Automatic pronunciation assessment models are regularly used in language learning applications. Common methodologies for pronunciation assessment use feature-based approaches, such as the Goodness-of-Pronunciation (GOP) approach, or deep learning speech recognition models to perform speech assessment. With the rise of transformers, pre-trained self-supervised learning (SSL) models have been utilized to extract contextual speech representations, showing improvements in various downstream tasks. In this study, we propose the end-to-end regressor (E2E-R) model for pronunciation scoring. E2E-R is trained using a two-step training process. In the first step, the pre-trained SSL model is fine-tuned on a phoneme recognition task to obtain better representations for pronounced phonemes. In the second step, transfer learning is used to obtain a pronunciation scoring model that uses a Siamese neural network to compare the pronounced phoneme representations to embeddings of the canonical phonemes and produce the final pronunciation scores. E2E-R achieves a Pearson correlation coefficient (PCC) of 0.68, which is similar to the state-of-the-art GOPT-PAII model while eliminating the need for training on additional native speech data, feature engineering, or external forced alignment modules. To our knowledge, this work presents the first utilization of a pre-trained SSL model for end-to-end phoneme-level pronunciation scoring on raw speech waveforms. a a The code is available at https://github.com/ai-zahran/E2E-R.

show abstract