AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

Rouditchenko, Andrew; Boggust, Angie; Harwath, David; Chen, Brian; Joshi, Dhiraj; Thomas, Samuel; Audhkhasi, Kartik; Kuehne, Hilde; Panda, Rameswar; Feris, Rogério; Kingsbury, Brian; Picheny, Michael; Torralba, Antonio; Glass, James

doi:10.48550/arxiv.2006.09199

Cited by 31 publications

(44 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use the following instructional video datasets: HowTo100M [13] (1.2M videos), YouCook2 [14] and YouCook-Japanese. For YouCook2, we use 9,586 train clips and 3,350 validation clips as in [12]. We evaluate performance on audio to video clip retrieval and video clip to audio retrieval using the standard recall metrics R@1, R@5, R@10.…”

Section: Methodsmentioning

confidence: 99%

“…Boggust et al [31] applied an image-caption [4] model to videos, using a single image frame from entire video clips to perform video to audio retrieval. Rouditchenko et al [12] proposed the AVLnet model that pools visual information over entire clips using 2D and 3D visual CNNs. We use AVLnet trained on English HowTo100M videos and apply it to cooking videos in Japanese and images and spoken captions in Japanese and Hindi.…”

Section: Related Workmentioning

confidence: 99%

“…Our goal is to learn audio-visual representations for videos in languages other than English using AVLnet [12]. AVLnet is trained through a contrastive loss to discriminate between temporally aligned audio-video pairs and temporally mismatched pairs from both within the same video and from other videos.…”

Section: Videosmentioning

confidence: 99%

See 2 more Smart Citations

Cascaded Multilingual Audio-Visual Learning from Videos

Rouditchenko¹,

Boggust²,

Harwath³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Cascaded Multilingual Audio-Visual Learning from Videos

Rouditchenko¹,

Boggust²,

Harwath³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Models like CLIP learn image classification by matching images with their captions, contrastively [130,91,62]. Recent work has explored this paradigm for matching video frames with their transcripts [121], with their audio signal [96,114], or both [3,2]; these works likewise perform well on single-modality tasks like audio classification and activity recognition. These independent encoders can be combined through late fusion [96], yet late fusion is strictly less expressive than our proposed joint encoding (early fusion) approach.…”

Section: Related Workmentioning

confidence: 99%

MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

Zellers¹,

Lü²,

Lu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Though some of these methods adopt expert features including object [79], sound [19,58] and speech [19] information. Compared with the most related work, Frozen [6], our RegionLearner brings significant improvements on text-tovideo retrieval.…”

Section: Pt Datasetmentioning

confidence: 99%

Video-Text Pre-training with Learned Regions

Yan

Shou²,

Ge³

et al. 2021

Preprint

View full text Add to dashboard Cite

Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs via aligning the semantics between visual and textual information. State-of-the-art approaches extract visual features from raw pixels in an end-to-end fashion. However, these methods operate at frame-level directly and thus overlook the spatiotemporal structure of objects in video, which yet has a strong synergy with nouns in textual descriptions. In this work, we propose a simple yet effective module for videotext representation learning, namely RegionLearner, which can take into account the structure of objects during pretraining on large-scale video-text pairs. Given a video, our module (1) first quantizes visual features into semantic clusters, then (2) generates learnable masks and uses them to aggregate the features belonging to the same semantic region, and finally (3) models the interactions between different aggregated regions. In contrast to using offthe-shelf object detectors, our proposed module does not require explicit supervision and is much more computationally efficient. We pre-train the proposed approach on the public WebVid2M and CC3M datasets. Extensive evaluations on four downstream video-text retrieval benchmarks clearly demonstrate the effectiveness of our RegionLearner. The code will be available at https://github.com/ ruiyan1995/Region_Learner.

show abstract

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

Cited by 31 publications

References 53 publications

Cascaded Multilingual Audio-Visual Learning from Videos

Cascaded Multilingual Audio-Visual Learning from Videos

MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

Video-Text Pre-training with Learned Regions

Contact Info

Product

Resources

About