Proceedings of the 24th Conference on Computational Natural Language Learning 2020
DOI: 10.18653/v1/2020.conll-1.22
|View full text |Cite
|
Sign up to set email alerts
|

Catplayinginthesnow: Impact of Prior Segmentation on a Model of Visually Grounded Speech

Abstract: The language acquisition literature shows that children do not build their lexicon by segmenting the spoken input into phonemes and then building up words from them, but rather adopt a top-down approach and start by segmenting word-like units and then break them down into smaller units. This suggests that the ideal way of learning a language is by starting from full semantic units. In this paper, we investigate if this is also the case for a neural model of Visually Grounded Speech trained on a speech-image re… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 9 publications
(10 citation statements)
references
References 22 publications
0
10
0
Order By: Relevance
“…Starting from the work of Synnaeve, Versteegh, and Dupoux (2014); Harwath and Glass (2015), researchers have studied the ability of models to learn to recognize the structure of spoken language, such as words and sub-word units, by training the models to associate speech waveforms with contextually relevant visual inputs. These works have looked at a variety of tasks, such as speech-image retrieval (Harwath, Torralba, and Glass 2016;Chrupała 2019;Ilharco, Zhang, and Baldridge 2019;Mortazavi 2020;Sanabria, Waters, and Baldridge 2021), automatic speech recognition (Sun, Harwath, and Glass 2016;Palaskar, Sanabria, and Metze 2018;Hsu, Harwath, and Glass 2019), word detection and localization (Kamper et al 2017;Harwath and Glass 2017;Merkx, Frank, and Ernestus 2019;Wang and Hasegawa-Johnson 2020;Olaleye and Kamper 2021), hierarchical linguistic unit analysis (Chrupała, Gelderloos, and Alishahi 2017;Harwath, Hsu, and Glass 2020), cross-modality alignment Wang et al 2021;Khorrami and Räsänen 2021), speech segmentation , speech generation (Hsu et al 2021b), and learning multilingual speech representations (Harwath, Chuang, and Glass 2018;Kamper and Roth 2018;Havard, Chevrot, and Besacier 2020;Ohishi et al 2020). In this paper, we study the recently proposed FaST-VGS (Peng and Harwath 2021) speech-image retrieval model, and and propose a novel extention of the model that incorporates a wav2vec2.0style ) masked language modeling objective in a multi-task learning framework.…”
Section: Related Workmentioning
confidence: 99%
“…Starting from the work of Synnaeve, Versteegh, and Dupoux (2014); Harwath and Glass (2015), researchers have studied the ability of models to learn to recognize the structure of spoken language, such as words and sub-word units, by training the models to associate speech waveforms with contextually relevant visual inputs. These works have looked at a variety of tasks, such as speech-image retrieval (Harwath, Torralba, and Glass 2016;Chrupała 2019;Ilharco, Zhang, and Baldridge 2019;Mortazavi 2020;Sanabria, Waters, and Baldridge 2021), automatic speech recognition (Sun, Harwath, and Glass 2016;Palaskar, Sanabria, and Metze 2018;Hsu, Harwath, and Glass 2019), word detection and localization (Kamper et al 2017;Harwath and Glass 2017;Merkx, Frank, and Ernestus 2019;Wang and Hasegawa-Johnson 2020;Olaleye and Kamper 2021), hierarchical linguistic unit analysis (Chrupała, Gelderloos, and Alishahi 2017;Harwath, Hsu, and Glass 2020), cross-modality alignment Wang et al 2021;Khorrami and Räsänen 2021), speech segmentation , speech generation (Hsu et al 2021b), and learning multilingual speech representations (Harwath, Chuang, and Glass 2018;Kamper and Roth 2018;Havard, Chevrot, and Besacier 2020;Ohishi et al 2020). In this paper, we study the recently proposed FaST-VGS (Peng and Harwath 2021) speech-image retrieval model, and and propose a novel extention of the model that incorporates a wav2vec2.0style ) masked language modeling objective in a multi-task learning framework.…”
Section: Related Workmentioning
confidence: 99%
“…As a result, they found that peaks in the change-rate of activation magnitudes of the early CNN layers were highly correlated with transitions between phone segments. In contrast to studying whether the models learn to segment, Havard et al (2020) studied how the performance of VGS models improves if linguistic unit segmentation is provided as side information to the model during the training. They found that explicit introduction of segmentation cues led to substantial performance gains in the audiovisual retrieval task compared to regular VGS training.…”
Section: Evidence For Language Representations In Vgs Modelsmentioning
confidence: 99%
“…Hierarchical modeling is also applied to show effect of introducing phone, syllable, or word boundaries in spoken captions (Havard et al, 2020) and with a compact bilinear pooling in visual question answering (Fukui et al, 2016). There is some work that presents a bayesian probabilistic formulation to learn referential grounding in dialog (Liu et al, 2014), user preferences (Cadilhac et al, 2013), color descriptions (McMahan and Stone, 2015Andreas and Klein, 2014).…”
Section: Stratificationmentioning
confidence: 99%