Interspeech 2022 2022
DOI: 10.21437/interspeech.2022-10652
|View full text |Cite
|
Sign up to set email alerts
|

Word Discovery in Visually Grounded, Self-Supervised Speech Models

Abstract: In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective. We demonstrate that a nearly identical model architecture (HuBERT) trained with a masked language modeling loss does not exhibit this same ability, suggesting that the visual grounding objective is responsible for the emergence of this phenomenon. We propose the use of a minimum cut algorithm to automatically predict syllable boundaries in speech,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
8

Relationship

0
8

Authors

Journals

citations
Cited by 26 publications
(31 citation statements)
references
References 39 publications
1
5
0
Order By: Relevance
“…Aside from the MFCC features, which are expected to be a distant last, all other features yield comparable WER results except layer averaging. As found in other studies [33], the topmost layers of HuBERT are not the best feature representations. Averaging layers 6,7 and 8 led to slightly better results.…”
Section: Frame Level Units For Encoder-only Pretrainingsupporting
confidence: 62%
“…Aside from the MFCC features, which are expected to be a distant last, all other features yield comparable WER results except layer averaging. As found in other studies [33], the topmost layers of HuBERT are not the best feature representations. Averaging layers 6,7 and 8 led to slightly better results.…”
Section: Frame Level Units For Encoder-only Pretrainingsupporting
confidence: 62%
“…Note that the images are from the MS-COCO dataset [14]. discover (localize, segment, and identify) spoken words based on visually grounded models [20]. Unfortunately, these studies mainly focused only on monolingual settings.…”
Section: Related Workmentioning
confidence: 99%
“…Our self-supervised VGS models follow the structure of the research of Peng et al [20]. The model has a dual-encoder architecture, including (1) an audio encoder based on a selfsupervised speech model such as HuBERT [19] or Wav2Vec2.0 (W2V2) [3] and (2) an image encoder is a self-supervised vision transformer model as DINO-ViT [27].…”
Section: Self-supervised Visually Grounded Speech Modelmentioning
confidence: 99%
See 1 more Smart Citation
“…Relation to prior work. There are several previous studies that investigate SSL speech model compression [28,20,29,30] through sparsity, knowledge distillation, attention re-use, or their combinations. Our proposed study differs from them in several aspects.…”
Section: Related Workmentioning
confidence: 99%