Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? - A computational investigation

Khorrami, Khazar; Räsänen, Okko

doi:10.31234/osf.io/37zna

Cited by 11 publications

(20 citation statements)

References 85 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This suggests that utterances are implicitly segmented into phonemes within this architecture. These findings were partially corroborated by Khorrami and Räsänen (2021), with the proviso of rather lower scores and the fact that implicit phoneme segmentation is also present to a large extent in activations from untrained models, and thus is not fully due to learning, but simply to network dynamics.…”

Section: Phonological Formmentioning

confidence: 73%

“…An alternative approach which is easier to apply to human speech was proposed by Khorrami and Räsänen (2021) and involves using automatically computed pairwise sentence similarities derived from a text-based model as the proxy for human similarity judgments of semantic relatedness. The automatic semantic relatedness score (SRS) is based on word-word similarity scores as given by Word2Vec (Mikolov, Chen, Corrado, & Dean, 2013) embedding vectors and defined as follows:…”

Section: Evaluation Based On Word2vecmentioning

confidence: 99%

See 1 more Smart Citation

Visually Grounded Models of Spoken Language: A Survey of Datasets, Architectures and Evaluation Techniques

Chrupała¹

2022

jair

View full text Add to dashboard Cite

This survey provides an overview of the evolution of visually grounded models of spoken language over the last 20 years. Such models are inspired by the observation that when children pick up a language, they rely on a wide range of indirect and noisy clues, crucially including signals from the visual modality co-occurring with spoken utterances. Several fields have made important contributions to this approach to modeling or mimicking the process of learning language: Machine Learning, Natural Language and Speech Processing, Computer Vision and Cognitive Science. The current paper brings together these contributions in order to provide a useful introduction and overview for practitioners in all these areas. We discuss the central research questions addressed, the timeline of developments, and the datasets which enabled much of this work. We then summarize the main modeling architectures and offer an exhaustive overview of the evaluation metrics and analysis techniques.

show abstract

Section: Phonological Formmentioning

confidence: 73%

Section: Evaluation Based On Word2vecmentioning

confidence: 99%

Visually Grounded Models of Spoken Language: A Survey of Datasets, Architectures and Evaluation Techniques

Chrupała¹

2022

jair

View full text Add to dashboard Cite

show abstract

“…Our goal is to go beyond these analysis to test specific semantic phenomena as we did here with the Abstract Scenes dataset. Another step towards more naturalistic input is the use speech input instead of text Khorrami and Räsänen, 2021).…”

Section: Discussionmentioning

confidence: 99%

“…As commonly applied in other multimodal XSL work Khorrami and Räsänen, 2021), we assume that the visual system of the learner has already been developed to some degree and thus use a CNN pre-trained on ImageNet (Russakovsky et al, 2015) (but discard the final classification layer) to encode the images. Specifically, we use a ResNet 50 2 (He et al, 2016) to encode the images and train a linear embedding layer that maps the output of the pre-final layer of the CNN into the joint embedding space.…”

Section: Modelmentioning

confidence: 99%

Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

2021

View full text Add to dashboard Cite

Word concreteness and imageability have proven crucial in understanding how humans process and represent language in the brain. While word-embeddings do not explicitly incorporate the concreteness of words into their computations, they have been shown to accurately predict human judgments of concreteness and imageability. Inspired by the recent interest in using neural activity patterns to analyze distributed meaning representations, we first show that brain responses acquired while human subjects passively comprehend natural stories can significantly distinguish the concreteness levels of the words encountered. We then examine for the same task whether the additional perceptual information in the brain representations can complement the contextual information in the word-embeddings. However, the results of our predictive models and residual analyses indicate the contrary. We find that the relevant information in the brain representations is a subset of the relevant information in the contextualized wordembeddings, providing new insight into the existing state of natural language processing models.

show abstract

“…As commonly applied in other multimodal XSL work(Chrupała et al, 2015;Khorrami and Räsänen, 2021).6 WhileVinyals et al (2015) fed the image features only at the first timestep into the LSTM, here we feed it at every timestep as this showed to improve performance on our evaluation substantially. An explanation could be that when feeding the image features only at the first timestep the model gradually forgets about the input, and relies more on the language modeling task of next-word prediction, which does not aid the learning of visually-grounded semantics.…”

mentioning

confidence: 99%

Modeling the Interaction Between Perception-Based and Production-Based Learning in Children's Early Acquisition of Semantic Knowledge

Nikolaus¹,

Fourtassi²

2021

Preprint

View full text Add to dashboard Cite

Children learn the meaning of words and sentences in their native language at an impressive speed and from highly ambiguous input. To account for this learning, previous computational modeling has focused mainly on the study of perception-based mechanisms like cross-situational learning. However, children do not learn only by exposure to the input. As soon as they start to talk, they practice their knowledge in social interactions and they receive feedback from their caregivers. In this work, we propose a model integrating both perception- and production-based learning using artificial neural networks which we train on a large corpus of crowd-sourced images with corresponding descriptions. We found that production-based learning improves performance above and beyond perception-based learning across a wide range of semantic tasks including both word- and sentence-level semantics. In addition, we documented a synergy between these two mechanisms, where their alternation allows the model to converge on more balanced semantic knowledge. The broader impact of this work is to highlight the importance of modeling language learning in the context of social interactions where children are not only understood as passively absorbing the input, but also as actively participating in the construction of their linguistic knowledge.

show abstract

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? - A computational investigation

Cited by 11 publications

References 85 publications

Visually Grounded Models of Spoken Language: A Survey of Datasets, Architectures and Evaluation Techniques

Visually Grounded Models of Spoken Language: A Survey of Datasets, Architectures and Evaluation Techniques

Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

Modeling the Interaction Between Perception-Based and Production-Based Learning in Children's Early Acquisition of Semantic Knowledge

Contact Info

Product

Resources

About