Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-1312
|View full text |Cite
|
Sign up to set email alerts
|

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
16
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 52 publications
(16 citation statements)
references
References 0 publications
0
16
0
Order By: Relevance
“…Boggust et al (2019) sample audio-visual fragments from cooking videos, however their grounded model treats video frames as still images and discard their temporal order. Rouditchenko et al (2020) integrate the temporal information when encoding videos from the Howto100m dataset (Miech et al, 2019), and perform better than previous work in language and video clip retrieval. Models trained on such instructional video datasets often do not generalize well to other domains.…”
Section: Spoken Language Grounded In Videomentioning
confidence: 86%
See 1 more Smart Citation
“…Boggust et al (2019) sample audio-visual fragments from cooking videos, however their grounded model treats video frames as still images and discard their temporal order. Rouditchenko et al (2020) integrate the temporal information when encoding videos from the Howto100m dataset (Miech et al, 2019), and perform better than previous work in language and video clip retrieval. Models trained on such instructional video datasets often do not generalize well to other domains.…”
Section: Spoken Language Grounded In Videomentioning
confidence: 86%
“…Attempts to model or simulate the acquisition of spoken language via grounding in the visual modality date to the beginning of this century (Roy and Pentland, 2002) but have gained momentum recently with the revival of neural networks (e.g. Synnaeve et al, 2014;Harwath and Glass, 2015;Harwath et al, 2016;Harwath et al, 2018;Merkx et al, 2019;Havard et al, 2019a;Rouditchenko et al, 2020;Khorrami and Räsänen, 2021;Peng and Harwath, 2021). Current approaches work well enough from an applied point of view, but most are not generalizable to real-life situations that humans or adaptive artificial agents experience.…”
Section: Introductionmentioning
confidence: 99%
“…They incorporated visual, action, text and object features for cross modal representation learning. Recently AVLnet [176] and MMV [2] considered three modalities visual, audio and language for self-supervised representation learning. This research direction is also increasingly getting more attention due to the success of contrastive learning on many vision and language tasks and the access to the abundance of unlabeled multimodal video data on platforms such as YouTube, Instagram or Flickr.…”
Section: Multi-modalitymentioning
confidence: 99%
“…In parallel to ZeroSpeech, research on so-called visually grounded speech (VGS) models has given rise to an array of metrics to understand what these models are learning. In short, modern VGS models (e.g., Harwath et al, 2019;Harwath et al, 2018) are neural networks that learn statistical correspondences between visual images (or videos; Rouditchenko et al, 2021) and concurrent speech related to the contents of the visual input. Since these models demonstrate emerging understanding of the semantics between auditory speech and the visual world without ever being explicitly taught about the structure of either modality, researchers have become interested on whether the internal representations of these models also show signs of emergent linguistic organization.…”
Section: Model Evaluation On Multiple Criteriamentioning
confidence: 99%