AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

Rouditchenko, Andrew; Boggust, Angie; Harwath, David; Chen, Brian; Joshi, Dhiraj; Thomas, Samuel; Audhkhasi, Kartik; Kuehne, Hilde; Panda, Rameswar; Feris, Rogério; Kingsbury, Brian; Picheny, Michael; Torralba, Antonio; Glass, James

doi:10.21437/interspeech.2021-1312

Cited by 52 publications

(16 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Boggust et al (2019) sample audio-visual fragments from cooking videos, however their grounded model treats video frames as still images and discard their temporal order. Rouditchenko et al (2020) integrate the temporal information when encoding videos from the Howto100m dataset (Miech et al, 2019), and perform better than previous work in language and video clip retrieval. Models trained on such instructional video datasets often do not generalize well to other domains.…”

Section: Spoken Language Grounded In Videomentioning

confidence: 86%

“…Attempts to model or simulate the acquisition of spoken language via grounding in the visual modality date to the beginning of this century (Roy and Pentland, 2002) but have gained momentum recently with the revival of neural networks (e.g. Synnaeve et al, 2014;Harwath and Glass, 2015;Harwath et al, 2016;Harwath et al, 2018;Merkx et al, 2019;Havard et al, 2019a;Rouditchenko et al, 2020;Khorrami and Räsänen, 2021;Peng and Harwath, 2021). Current approaches work well enough from an applied point of view, but most are not generalizable to real-life situations that humans or adaptive artificial agents experience.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Learning English with Peppa Pig

Nikolaus,

Alishahi,

Chrupała

2022

Preprint

View full text Add to dashboard Cite

Attempts to computationally simulate the acquisition of spoken language via grounding in perception have a long tradition but have gained momentum in the past few years. Current neural approaches exploit associations between the spoken and visual modality and learn to represent speech and visual data in a joint vector space. A major unresolved issue from the point of ecological validity is the training data, typically consisting of images or videos paired with spoken descriptions of what is depicted. Such a setup guarantees an unrealistically strong correlation between speech and the visual world. In the real world the coupling between the linguistic and the visual is loose, and often contains confounds in the form of correlations with non-semantic aspects of the speech signal. The current study is a first step towards simulating a naturalistic grounding scenario by using a dataset based on the children's cartoon Peppa Pig. We train a simple bi-modal architecture on the portion of the data consisting of naturalistic dialog between characters, and evaluate on segments containing descriptive narrations. Despite the weak and confounded signal in this training data our model succeeds at learning aspects of the visual semantics of spoken language.

show abstract

Section: Spoken Language Grounded In Videomentioning

confidence: 86%

Section: Introductionmentioning

confidence: 99%

Learning English with Peppa Pig

Nikolaus,

Alishahi,

Chrupała

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…They incorporated visual, action, text and object features for cross modal representation learning. Recently AVLnet [176] and MMV [2] considered three modalities visual, audio and language for self-supervised representation learning. This research direction is also increasingly getting more attention due to the success of contrastive learning on many vision and language tasks and the access to the abundance of unlabeled multimodal video data on platforms such as YouTube, Instagram or Flickr.…”

Section: Multi-modalitymentioning

confidence: 99%

A Comprehensive Study of Deep Video Action Recognition

Zhu,

Li,

Liu

et al. 2020

Preprint

View full text Add to dashboard Cite

Video action recognition is one of the representative tasks for video understanding. Over the last decade, we have witnessed great advancements in video action recognition thanks to the emergence of deep learning. But we also encountered new challenges, including modeling longrange temporal information in videos, high computation costs, and incomparable results due to datasets and evaluation protocol variances. In this paper, we provide a comprehensive survey of over 200 existing papers on deep learning for video action recognition. We first introduce the 17 video action recognition datasets that influenced the design of models. Then we present video action recognition models in chronological order: starting with early attempts at adapting deep learning, then to the two-stream networks, followed by the adoption of 3D convolutional kernels, and finally to the recent compute-efficient models. In addition, we benchmark popular methods on several representative datasets and release code for reproducibility. In the end, we discuss open problems and shed light on opportunities for video action recognition to facilitate new research ideas.

show abstract

“…In parallel to ZeroSpeech, research on so-called visually grounded speech (VGS) models has given rise to an array of metrics to understand what these models are learning. In short, modern VGS models (e.g., Harwath et al, 2019;Harwath et al, 2018) are neural networks that learn statistical correspondences between visual images (or videos; Rouditchenko et al, 2021) and concurrent speech related to the contents of the visual input. Since these models demonstrate emerging understanding of the semantics between auditory speech and the visual world without ever being explicitly taught about the structure of either modality, researchers have become interested on whether the internal representations of these models also show signs of emergent linguistic organization.…”

Section: Model Evaluation On Multiple Criteriamentioning

confidence: 99%

Introducing meta-analysis in the evaluation of computational models of infant language development

Blandón¹,

Cristià²,

Räsänen³

2021

Preprint

View full text Add to dashboard Cite

Computational models of child language development can help us understand the cognitive underpinnings of the language learning process. One advantage of computational modeling is that is has the potential to address multiple aspects of language learning within a single learning architecture. If successful, such integrated models would help to pave the way for a more comprehensive and mechanistic understanding of language development. However, in order to develop more accurate, holistic, and hence impactful models of infant language learning, the research on models also requires model evaluation practices that allow comparison of model behavior to empirical data from infants across a range of language capabilities. Moreover, there is a need for practices that can compare developmental trajectories of infants to those of models as a function of language experience. The present study aims to take the first steps to address these needs. More specifically, we will introduce the concept of comparing models with large-scale cumulative empirical data from infants, as quantified by meta-analyses conducted across a large number of individual behavioral studies. We start by formalizing the connection between measurable model and human behavior, and then present a basic conceptual framework for meta-analytic evaluation of computational models together with basic guidelines intended as a starting point for later work in this direction. We exemplify the meta-analytic model evaluation approach with two modeling experiments on infant-directed speech preference and native/non-native vowel discrimination. We also discuss the advantages, challenges, and potential future directions of meta-analytic evaluation practices.

show abstract

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

Cited by 52 publications

References 0 publications

Learning English with Peppa Pig

Learning English with Peppa Pig

A Comprehensive Study of Deep Video Action Recognition

Introducing meta-analysis in the evaluation of computational models of infant language development

Contact Info

Product

Resources

About