Acoustically Grounded Word Embeddings for Improved Acoustics-to-word Speech Recognition

Settle, Shane; Audhkhasi, Kartik; Livescu, Karen; Picheny, Michael

doi:10.1109/icassp.2019.8682903

Cited by 31 publications

(37 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…2, we use PWCCA to measure similarity between the W2V2 layer representations and various continuous-valued quantities of interest, either (i) from a different layer of the same model (CCA-intra), (ii) from a fine-tuned version of the model (CCA-inter), or (iii) from an external representation. For the third type of analysis we use mel filter bank features (CCA-mel), acoustically grounded word embeddings [31] (cca-agwe) 1 and GloVe word embeddings [32] (cca-glove) as ways to assess the local acoustic, word-level acoustic-phonetic, and word meaning information encoded in the W2V2 representations respectively.…”

Section: Analysis Methodsmentioning

confidence: 99%

Layer-wise Analysis of a Self-supervised Speech Representation Model

Pasad¹,

Chou²,

Livescu³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Recently proposed self-supervised learning approaches have been successful for pre-training speech representation models. The utility of these learned representations has been observed empirically, but not much has been studied about the type or extent of information encoded in the pre-trained representations themselves. Developing such insights can help understand the capabilities and limits of these models and enable the research community to more efficiently develop their usage for downstream applications. In this work, we begin to fill this gap by examining one recent and successful pre-trained model (wav2vec 2.0), via its intermediate representation vectors, using a suite of analysis tools. We use the metrics of canonical correlation, mutual information, and performance on simple downstream tasks with non-parametric probes, in order to (i) query for acoustic and linguistic information content, (ii) characterize the evolution of information across model layers, and (iii) understand how fine-tuning the model for automatic speech recognition (ASR) affects these observations. Our findings motivate modifying the fine-tuning protocol for ASR, which produces improved word error rates in a low-resource setting.

show abstract

Section: Analysis Methodsmentioning

confidence: 99%

Layer-wise Analysis of a Self-supervised Speech Representation Model

Pasad¹,

Chou²,

Livescu³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Following previous work [24,25], we use a classification objective as our neural baseline (Fig. 1-a).…”

Section: Phone N-gram Detection Objectivementioning

confidence: 99%

Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study

Abdullah¹,

Mosbach²,

Zaitova³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

Several variants of deep neural networks have been successfully employed for building parametric models that project variableduration spoken word segments onto fixed-size vector representations, or acoustic word embeddings (AWEs). However, it remains unclear to what degree we can rely on the distance in the emerging AWE space as an estimate of word-form similarity. In this paper, we ask: does the distance in the acoustic embedding space correlate with phonological dissimilarity? To answer this question, we empirically investigate the performance of supervised approaches for AWEs with different neural architectures and learning objectives. We train AWE models in controlled settings for two languages (German and Czech) and evaluate the embeddings on two tasks: word discrimination and phonological similarity. Our experiments show that (1) the distance in the embedding space in the best cases only moderately correlates with phonological distance, and (2) improving the performance on the word discrimination task does not necessarily yield models that better reflect word phonological similarity. Our findings highlight the necessity to rethink the current intrinsic evaluations for AWEs.

show abstract

“…Segments can then be efficiently compared by calculating the distance in the embedding space. Given the advantages AWEs have over alignment methods, several AWE models have been proposed [12][13][14][15][16][17][18][19][20][21][22][23]. Many of these are for the supervised setting, using labelled data to train a discriminative model.…”

Section: Introductionmentioning

confidence: 99%

Multilingual Transfer of Acoustic Word Embeddings Improves When Training on Languages Related to the Target Zero-Resource Language

Jacobs¹,

Kamper²

2021

Interspeech 2021

View full text Add to dashboard Cite

Acoustic word embedding models map variable duration speech segments to fixed dimensional vectors, enabling efficient speech search and discovery. Previous work explored how embeddings can be obtained in zero-resource settings where no labelled data is available in the target language. The current best approach uses transfer learning: a single supervised multilingual model is trained using labelled data from multiple well-resourced languages and then applied to a target zero-resource language (without fine-tuning). However, it is still unclear how the specific choice of training languages affect downstream performance. Concretely, here we ask whether it is beneficial to use training languages related to the target. Using data from eleven languages spoken in Southern Africa, we experiment with adding data from different language families while controlling for the amount of data per language. In word discrimination and query-by-example search evaluations, we show that training on languages from the same family gives large improvements. Through finer-grained analysis, we show that training on even just a single related language gives the largest gain. We also find that adding data from unrelated languages generally doesn't hurt performance.

show abstract

Acoustically Grounded Word Embeddings for Improved Acoustics-to-word Speech Recognition

Cited by 31 publications

References 26 publications

Layer-wise Analysis of a Self-supervised Speech Representation Model

Layer-wise Analysis of a Self-supervised Speech Representation Model

Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study

Multilingual Transfer of Acoustic Word Embeddings Improves When Training on Languages Related to the Target Zero-Resource Language

Contact Info

Product

Resources

About