Word Discovery in Visually Grounded, Self-Supervised Speech Models

Puyuan, Peng,; Harwath, David

doi:10.21437/interspeech.2022-10652

Cited by 26 publications

(31 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Aside from the MFCC features, which are expected to be a distant last, all other features yield comparable WER results except layer averaging. As found in other studies [33], the topmost layers of HuBERT are not the best feature representations. Averaging layers 6,7 and 8 led to slightly better results.…”

Section: Frame Level Units For Encoder-only Pretrainingsupporting

confidence: 62%

Do Coarser Units Benefit Cluster Prediction-Based Speech Pre-Training?

Elkahky¹,

Hsu²,

Tomasello³

et al. 2023

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

The research community has produced many successful selfsupervised speech representation learning methods over the past few years. Discrete units have been utilized in various self-supervised learning frameworks, such as VQ-VAE [1], wav2vec 2.0 [2], Hu-BERT [3], and Wav2Seq [4]. This paper studies the impact of altering the granularity and improving the quality of these discrete acoustic units for pre-training encoder-only and encoder-decoder models. We systematically study the current proposals of using Byte-Pair Encoding (BPE) and new extensions that use cluster smoothing and Brown clustering. The quality of learned units is studied intrinsically using zero speech metrics and on the downstream speech recognition (ASR) task. Our results suggest that longer-range units are helpful for encoder-decoder pre-training; however, encoder-only masked-prediction models cannot yet benefit from self-supervised word-like targets.

show abstract

Section: Frame Level Units For Encoder-only Pretrainingsupporting

confidence: 62%

Do Coarser Units Benefit Cluster Prediction-Based Speech Pre-Training?

Elkahky¹,

Hsu²,

Tomasello³

et al. 2023

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Note that the images are from the MS-COCO dataset [14]. discover (localize, segment, and identify) spoken words based on visually grounded models [20]. Unfortunately, these studies mainly focused only on monolingual settings.…”

Section: Related Workmentioning

confidence: 99%

“…Our self-supervised VGS models follow the structure of the research of Peng et al [20]. The model has a dual-encoder architecture, including (1) an audio encoder based on a selfsupervised speech model such as HuBERT [19] or Wav2Vec2.0 (W2V2) [3] and (2) an image encoder is a self-supervised vision transformer model as DINO-ViT [27].…”

Section: Self-supervised Visually Grounded Speech Modelmentioning

confidence: 99%

See 1 more Smart Citation

VGSAlign: Bilingual Speech Alignment of Unpaired and Untranscribed Languages using Self-Supervised Visually Grounded Speech Models

Nguyen,

Sakti

2023

2nd Annual Meeting of the ELRA/ISCA SIG on Under-Resourced Languages (SIGUL 2023)

View full text Add to dashboard Cite

Direct neural speech-to-speech translation (S2ST) systems enable translating speech from source to target languages without the need for text transcription. However, these systems are mostly trained using supervised learning that relies on a massive amount of parallel source-target speech data, which is often unavailable. This paper proposes a bilingual speech alignment approach called VGSAlign, as the initial solution for obtaining paired data from unknown, untranscribed, and unpaired speech data. Here, we assume the speech has auxiliary input from the visual modality that describes the semantic information. The approach then leverages the ability (1) to discover spoken words in multiple languages from the correspondences between speech segments and part of images based on self-supervised visually grounded speech models and (2) to find the visually grounded semantically equivalent between the spoken discovery of speech segments of source and target languages. By learning the representations of speech and images, VGSAlign shows the potential to achieve bilingual speech alignment based on visual representation. Furthermore, experimental results show that the proposed approach could work effectively with unknown, untranscribed, and unpaired speech without being trained on any supervised tasks.

show abstract

“…Relation to prior work. There are several previous studies that investigate SSL speech model compression [28,20,29,30] through sparsity, knowledge distillation, attention re-use, or their combinations. Our proposed study differs from them in several aspects.…”

Section: Related Workmentioning

confidence: 99%

4-bit Conformer with Native Quantization Aware Training for Speech Recognition

Ding¹,

Phoenix²,

He³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

End-to-end automatic speech recognition (ASR) models have seen revolutionary quality gains with the recent development of large-scale universal speech models (USM). However, deploying these massive USMs is extremely expensive due to the enormous memory usage and computational cost. Therefore, model compression is an important research topic to fit USM-based ASR under budget in real-world scenarios. In this study, we propose a USM fine-tuning approach for ASR, with a low-bit quantization and N :M structured sparsity aware paradigm on the model weights, reducing the model complexity from parameter precision and matrix topology perspectives. We conducted extensive experiments with a 2-billion parameter USM on a largescale voice search dataset to evaluate our proposed method. A series of ablation studies validate the effectiveness of up to int4 quantization and 2:4 sparsity. However, a single compression technique fails to recover the performance well under extreme setups including int2 quantization and 1:4 sparsity. By contrast, our proposed method can compress the model to have 9.4% of the size, at the cost of only 7.3% relative word error rate (WER) regressions. We also provided in-depth analyses on the results and discussions on the limitations and potential solutions, which would be valuable for future studies.

show abstract

Word Discovery in Visually Grounded, Self-Supervised Speech Models

Cited by 26 publications

References 39 publications

Do Coarser Units Benefit Cluster Prediction-Based Speech Pre-Training?

Do Coarser Units Benefit Cluster Prediction-Based Speech Pre-Training?

VGSAlign: Bilingual Speech Alignment of Unpaired and Untranscribed Languages using Self-Supervised Visually Grounded Speech Models

4-bit Conformer with Native Quantization Aware Training for Speech Recognition

Contact Info

Product

Resources

About