2021
DOI: 10.31234/osf.io/37zna
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? - A computational investigation

Abstract: Decades of research has studied how language learning infants learn to discriminate speech sounds, segment words, and associate words with their meanings. While gradual development of such capabilities is unquestionable, the exact nature of these skills and the underlying mental representations yet remains unclear. In parallel, computational studies have shown that basic comprehension of speech can be achieved by statistical learning between speech and concurrent referentially ambiguous visual input. These mod… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
15
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 11 publications
(20 citation statements)
references
References 85 publications
0
15
0
Order By: Relevance
“…This suggests that utterances are implicitly segmented into phonemes within this architecture. These findings were partially corroborated by Khorrami and Räsänen (2021), with the proviso of rather lower scores and the fact that implicit phoneme segmentation is also present to a large extent in activations from untrained models, and thus is not fully due to learning, but simply to network dynamics.…”
Section: Phonological Formmentioning
confidence: 73%
See 1 more Smart Citation
“…This suggests that utterances are implicitly segmented into phonemes within this architecture. These findings were partially corroborated by Khorrami and Räsänen (2021), with the proviso of rather lower scores and the fact that implicit phoneme segmentation is also present to a large extent in activations from untrained models, and thus is not fully due to learning, but simply to network dynamics.…”
Section: Phonological Formmentioning
confidence: 73%
“…An alternative approach which is easier to apply to human speech was proposed by Khorrami and Räsänen (2021) and involves using automatically computed pairwise sentence similarities derived from a text-based model as the proxy for human similarity judgments of semantic relatedness. The automatic semantic relatedness score (SRS) is based on word-word similarity scores as given by Word2Vec (Mikolov, Chen, Corrado, & Dean, 2013) embedding vectors and defined as follows:…”
Section: Evaluation Based On Word2vecmentioning
confidence: 99%
“…Our goal is to go beyond these analysis to test specific semantic phenomena as we did here with the Abstract Scenes dataset. Another step towards more naturalistic input is the use speech input instead of text Khorrami and Räsänen, 2021).…”
Section: Discussionmentioning
confidence: 99%
“…As commonly applied in other multimodal XSL work Khorrami and Räsänen, 2021), we assume that the visual system of the learner has already been developed to some degree and thus use a CNN pre-trained on ImageNet (Russakovsky et al, 2015) (but discard the final classification layer) to encode the images. Specifically, we use a ResNet 50 2 (He et al, 2016) to encode the images and train a linear embedding layer that maps the output of the pre-final layer of the CNN into the joint embedding space.…”
Section: Modelmentioning
confidence: 99%
“…As commonly applied in other multimodal XSL work(Chrupała et al, 2015;Khorrami and Räsänen, 2021).6 WhileVinyals et al (2015) fed the image features only at the first timestep into the LSTM, here we feed it at every timestep as this showed to improve performance on our evaluation substantially. An explanation could be that when feeding the image features only at the first timestep the model gradually forgets about the input, and relies more on the language modeling task of next-word prediction, which does not aid the learning of visually-grounded semantics.…”
mentioning
confidence: 99%