2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2021
DOI: 10.1109/asru51503.2021.9688093
|View full text |Cite
|
Sign up to set email alerts
|

Layer-Wise Analysis of a Self-Supervised Speech Representation Model

Abstract: Many self-supervised speech models, varying in their pretraining objective, input modality, and pre-training data, have been proposed in the last few years. Despite impressive empirical successes on downstream tasks, we still have a limited understanding of the properties encoded by the models and the differences across models. In this work, we examine the intermediate representations for a variety of recent models. Specifically, we measure acoustic, phonetic, and word-level properties encoded in individual la… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

7
64
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 116 publications
(82 citation statements)
references
References 46 publications
7
64
0
Order By: Relevance
“…In Table 2, we first note that the highest overall performance for the Area and Boundary metrics is achieved by the VG-HuBERT3 model, while the highest performance for the Word metrics is achieved by the VG-HuBERT4 model. For both VG-HuBERT and VG-W2V2, we find that reinitializing the last few layers before training on the visual grounding task is highly beneficial, which is in line with the results found by [32]. A more extensive comparison of layer reinitialization is shown in Table 3.…”
Section: Methodssupporting
confidence: 86%
See 2 more Smart Citations
“…In Table 2, we first note that the highest overall performance for the Area and Boundary metrics is achieved by the VG-HuBERT3 model, while the highest performance for the Word metrics is achieved by the VG-HuBERT4 model. For both VG-HuBERT and VG-W2V2, we find that reinitializing the last few layers before training on the visual grounding task is highly beneficial, which is in line with the results found by [32]. A more extensive comparison of layer reinitialization is shown in Table 3.…”
Section: Methodssupporting
confidence: 86%
“…Although increasing the number of clusters does lead to a larger number of detectors (Figure 3), we see significant diminishing returns and thus fix K = 4096 in all of our experiments, unless stated otherwise. In Figure 4, we see that for all models, word detection performance is best in the middle and upper half of the model, which is also consistent with the analysis of [32] Finally, we compare the word detection performance of VG-HuBERT when using oracle word boundaries to determine segments (rather than the model's thresholded self-attention), which we counterintuitively find hurts the model's word detection ability. Combined with the observation that the attention segments tend to concentrate at the nucleus of words, we hypothesize that the model's contextualization is pushing word identity information towards the temporal center of each word.…”
Section: Methodssupporting
confidence: 76%
See 1 more Smart Citation
“…The effect is very pronounced for speech and NLP while for vision there is still a slight advantage of predicting more than a single layer. Pasad et al, 2021).…”
Section: Ablationsmentioning
confidence: 99%
“…Many self-supervised speech models have been proposed in previous studies. It has been observed that different layers of SSL-based models contain different information like speaker identity, content and semantics [35][36][37][38]. In particular, higher and middle layers of SSL-based models tend to capture richer linguistic information.…”
Section: Hubert-based Soft Content Encodermentioning
confidence: 99%