HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Miech, Antoine; Zhukov, Dimitri; Alayrac, Jean-Baptiste; Tapaswi, Makarand; Laptev, Ivan; Šivic, Josef

doi:10.1109/iccv.2019.00272

Cited by 693 publications

(864 citation statements)

References 56 publications

Supporting

Mentioning

861

Contrasting

Order By: Relevance

“…(2) Scale: Compared with the recent datasets for image classification (e.g., ImageNet [18] with 1 million images) and action detection (e.g., ActivityNet v1.3 [30] with 20k videos), most existing instructional video datasets are relatively smaller in scale. Though the HowTo100M dataset provided a great amount of data, its automaticly generated annotation might be inaccurate as the authors mentioned in [46]. The challenge of building such a large-scale dataset mainly stems from the difficulty to organize enormous amount of video and the heavy workload of annotation.…”

Section: Datasets Related To Instructional Video Analysismentioning

confidence: 99%

Comprehensive Instructional Video Analysis: The COIN Dataset and Performance Evaluation

Tang

Lu²,

Zhou³

2021

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Thanks to the substantial and explosively inscreased instructional videos on the Internet, novices are able to acquire knowledge for completing various tasks. Over the past decade, growing efforts have been devoted to investigating the problem on instructional video analysis. However, the most existing datasets in this area have limitations in diversity and scale, which makes them far from many real-world applications where more diverse activities occur. To address this, we present a large-scale dataset named as "COIN" for COmprehensive INstructional video analysis. Organized with a hierarchical structure, the COIN dataset contains 11,827 videos of 180 tasks in 12 domains (e.g., vehicles, gadgets, etc.) related to our daily life. With a new developed toolbox, all the videos are annotated efficiently with a series of step labels and the corresponding temporal boundaries. In order to provide a benchmark for instructional video analysis, we evaluate plenty of approaches on the COIN dataset under five different settings. Furthermore, we exploit two important characteristics (i.e., task-consistency and ordering-dependency) for localizing important steps in instructional videos. Accordingly, we propose two simple yet effective methods, which can be easily plugged into conventional proposal-based action detection models. We believe the introduction of the COIN dataset will promote the future in-depth research on instructional video analysis for the community. Our dataset, annotation toolbox and source code are available at http://coin-dataset.github.io. DomainTaskStep VehiclesHousehold Items Change the Car Tire {unscrew the screws, jack up the car, remove the tire, put on the tire, tighten the screws } {remove the door knob, remove bolt and pin board, install new pin board, install new bolt, install new door knob } Replace the Door Knob

show abstract

Section: Datasets Related To Instructional Video Analysismentioning

confidence: 99%

Comprehensive Instructional Video Analysis: The COIN Dataset and Performance Evaluation

Tang

Lu²,

Zhou³

2021

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

show abstract

“…1) Setup for MSR-VTT: We follow the official data split, which divides MSR-VTT into three disjoint subsets used for training, validation and test, respectively. Note that in [34] and its follow-ups [16]- [18], a smaller test set of 1,000 videos randomly sampled from the full test set is used, which we refer to as test-1k.…”

Section: A Experimental Setupmentioning

confidence: 99%

“…• Miech et al [18]: Use a 1D-CNN as its sentence encoder. • Dual Encoding [14]: Hierarchical encoding that combines BoW, bi-GRU and 1D-CNN.…”

Section: Experiments 3 Combined Loss Versus Single Lossmentioning

confidence: 99%

SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries

Zhou

et al. 2021

IEEE Trans. Multimedia

View full text Add to dashboard Cite

Retrieving unlabeled videos by textual queries, known as Ad-hoc Video Search (AVS), is a core theme in multimedia data management and retrieval. The success of AVS counts on cross-modal representation learning that encodes both query sentences and videos into common spaces for semantic similarity computation. Inspired by the initial success of previously few works in combining multiple sentence encoders, this paper takes a step forward by developing a new and general method for effectively exploiting diverse sentence encoders. The novelty of the proposed method, which we term Sentence Encoder Assembly (SEA), is twofold. First, different from prior art that use only a single common space, SEA supports text-video matching in multiple encoder-specific common spaces. Such a property prevents the matching from being dominated by a specific encoder that produces an encoding vector much longer than other encoders. Second, in order to explore complementarities among the individual common spaces, we propose multi-space multi-loss learning. As extensive experiments on four benchmarks (MSR-VTT, TRECVID AVS 2016-2019, TGIF and MSVD) show, SEA surpasses the state-of-the-art. In addition, SEA is extremely ease to implement. All this makes SEA an appealing solution for AVS and promising for continuously advancing the task by harvesting new sentence encoders.

show abstract

“…It learns a single output embedding which is the weighted similarity between the different implicit visual-text embeddings. Recently, Miech et al [23] propose the HowTo100M dataset: A large dataset collected automatically using generated captions from youtube of 'how to tasks'. They find that finetuning on these weakly-paired video clips allows for stateof-the-art performance on a number of different datasets.…”

Section: Related Workmentioning

confidence: 99%

Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings

Wray

Csurka

Larlus

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

129

104

View full text Add to dashboard Cite

We address the problem of cross-modal fine-grained action retrieval between text and video. Cross-modal retrieval is commonly achieved through learning a shared embedding space, that can indifferently embed modalities. In this paper, we propose to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions. We build a separate multi-modal embedding space for each PoS tag. The outputs of multiple PoS embeddings are then used as input to an integrated multi-modal space, where we perform action retrieval. All embeddings are trained jointly through a combination of PoS-aware and PoS-agnostic losses. Our proposal enables learning specialised embedding spaces that offer multiple views of the same embedded entities.We report the first retrieval results on fine-grained actions for the large-scale EPIC dataset, in a generalised zero-shot setting. Results show the advantage of our approach for both video-to-text and text-to-video action retrieval. We also demonstrate the benefit of disentangling the PoS for the generic task of cross-modal video retrieval on the MSR-VTT dataset.

show abstract

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Cited by 693 publications

References 56 publications

Comprehensive Instructional Video Analysis: The COIN Dataset and Performance Evaluation

Comprehensive Instructional Video Analysis: The COIN Dataset and Performance Evaluation

SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries

Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings

Contact Info

Product

Resources

About