Antoine Miech scite author profile

Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-ofthe-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models are publicly available [1].

show abstract

End-to-End Learning of Visual Representations From Uncurated Instructional Videos

Miech

et al. 2020

View full text Add to dashboard Cite

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

et al. 2021

View full text Add to dashboard Cite

Flamingo: a Visual Language Model for Few-Shot Learning

Alayrac¹,

Donahue²,

Luc³

et al. 2022

Preprint

View full text Add to dashboard Cite

ordered alphabetically, † Equal contributions, ordered alphabetically, ‡ Equal senior contributions Building models that can be rapidly adapted to numerous tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. Flamingo models include key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of the proposed Flamingo models, exploring and measuring their ability to rapidly adapt to a variety of image and video understanding benchmarks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer, captioning tasks, which evaluate the ability to describe a scene or an event, and close-ended tasks such as multiple choice visual question-answering. For tasks lying anywhere on this spectrum, we demonstrate that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples. On many of these benchmarks, Flamingo actually surpasses the performance of models that are fine-tuned on thousands of times more task-specific data.

show abstract

Leveraging the Present to Anticipate the Future in Videos

Miech

Laptev

Šivic

et al. 2019

View full text Add to dashboard Cite

Anticipating actions before they are executed is crucial for a wide range of practical applications including autonomous driving and robotics. While most prior work in this area requires partial observation of executed actions, in the paper we focus on anticipating actions seconds before they start. Our proposed approach is the fusion of a purely anticipatory model with a complementary model constrained to reason about the present. In particular, the latter predicts present action and scene attributes, and reasons about how they evolve over time. By doing so, we aim at modeling action anticipation at a more conceptual level than directly predicting future actions. Our model outperforms previously reported methods on the EPIC-KITCHENS and Breakfast datasets.

show abstract

Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers

Miech

Alayrac

Laptev

et al. 2021

View full text Add to dashboard Cite

Learning from Video and Text via Large-Scale Discriminative Clustering

Miech

Alayrac

Bojanowski

et al. 2017

View full text Add to dashboard Cite

Discriminative clustering has been successfully applied to a number of weakly-supervised learning tasks. Such applications include person and action recognition, text-to-video alignment, object co-segmentation and colocalization in videos and images. One drawback of discriminative clustering, however, is its limited scalability. We address this issue and propose an online optimization algorithm based on the Block-Coordinate Frank-Wolfe algorithm. We apply the proposed method to the problem of weakly-supervised learning of actions and actors from movies together with corresponding movie scripts. The scaling up of the learning problem to 66 featurelength movies enables us to significantly improve weaklysupervised action recognition.

show abstract

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Miech¹,

Zhukov²,

Alayrac³

et al. 2019

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Antoine Miech

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

End-to-End Learning of Visual Representations From Uncurated Instructional Videos

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

Flamingo: a Visual Language Model for Few-Shot Learning

Leveraging the Present to Anticipate the Future in Videos

Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers

Learning from Video and Text via Large-Scale Discriminative Clustering

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Contact Info

Product

Resources

About