2020
DOI: 10.48550/arxiv.2006.09199
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

Abstract: Current methods for learning visually grounded language from videos often rely on time-consuming and expensive data collection, such as human annotated textual summaries or machine generated automatic speech recognition transcripts. In this work, we introduce Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs. We circumvent the need for annotation and instead learn audiovisual language representations directly from r… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
44
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 31 publications
(44 citation statements)
references
References 53 publications
0
44
0
Order By: Relevance
“…We use the following instructional video datasets: HowTo100M [13] (1.2M videos), YouCook2 [14] and YouCook-Japanese. For YouCook2, we use 9,586 train clips and 3,350 validation clips as in [12]. We evaluate performance on audio to video clip retrieval and video clip to audio retrieval using the standard recall metrics R@1, R@5, R@10.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…We use the following instructional video datasets: HowTo100M [13] (1.2M videos), YouCook2 [14] and YouCook-Japanese. For YouCook2, we use 9,586 train clips and 3,350 validation clips as in [12]. We evaluate performance on audio to video clip retrieval and video clip to audio retrieval using the standard recall metrics R@1, R@5, R@10.…”
Section: Methodsmentioning
confidence: 99%
“…Boggust et al [31] applied an image-caption [4] model to videos, using a single image frame from entire video clips to perform video to audio retrieval. Rouditchenko et al [12] proposed the AVLnet model that pools visual information over entire clips using 2D and 3D visual CNNs. We use AVLnet trained on English HowTo100M videos and apply it to cooking videos in Japanese and images and spoken captions in Japanese and Hindi.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Models like CLIP learn image classification by matching images with their captions, contrastively [130,91,62]. Recent work has explored this paradigm for matching video frames with their transcripts [121], with their audio signal [96,114], or both [3,2]; these works likewise perform well on single-modality tasks like audio classification and activity recognition. These independent encoders can be combined through late fusion [96], yet late fusion is strictly less expressive than our proposed joint encoding (early fusion) approach.…”
Section: Related Workmentioning
confidence: 99%
“…Though some of these methods adopt expert features including object [79], sound [19,58] and speech [19] information. Compared with the most related work, Frozen [6], our RegionLearner brings significant improvements on text-tovideo retrieval.…”
Section: Pt Datasetmentioning
confidence: 99%