Learnable Pooling Methods for Video Classification

Kmiec, Sebastian; Bae, Juhan; An, Ruijian

doi:10.1007/978-3-030-11018-5_21

Cited by 88 publications

(173 citation statements)

References 55 publications

(153 reference statements)

Supporting

Mentioning

170

Contrasting

Unclassified

Order By: Relevance

“…In practice, d v = 4, 096, d c = 4, 096 and d = 4, 096 resulting in a model composed of 67M parameters. Note that the first term on the right-hand side in Equations (2) and (3) is a linear fullyconnected layer and the second term corresponds to a context gating function [31] with an output ranging between 0 and 1, which role is to modulate the output of the linear layer. As a result, this embedding function can model nonlinear multiplicative interactions between the dimensions of the input feature vector which has proven effective in other text-video embedding applications [32].…”

Section: Text-video Joint Embedding Modelmentioning

confidence: 99%

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Miech¹,

Zhukov²,

Alayrac³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

777

873

View full text Add to dashboard Cite

Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-ofthe-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models are publicly available [1].

show abstract

Section: Text-video Joint Embedding Modelmentioning

confidence: 99%

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Miech¹,

Zhukov²,

Alayrac³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

777

873

View full text Add to dashboard Cite

show abstract

“…This method is evaluated on higher-level activities, showing that such a visual embedding aligns well with the learned space of Word2Vec to perform zero-shot recognition of these coarser-grained classes. Miech et al [21] found that using NetVLAD [3] results in an increase in accuracy over GRUs or LSTMs for aggregation of both visual and text features. A follow up on this work [22] learns a mixture of experts embedding from multiple modalities such as appearance, motion, audio or face features.…”

Section: Related Workmentioning

confidence: 99%

Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings

Wray

Csurka

Larlus

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

138

108

View full text Add to dashboard Cite

We address the problem of cross-modal fine-grained action retrieval between text and video. Cross-modal retrieval is commonly achieved through learning a shared embedding space, that can indifferently embed modalities. In this paper, we propose to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions. We build a separate multi-modal embedding space for each PoS tag. The outputs of multiple PoS embeddings are then used as input to an integrated multi-modal space, where we perform action retrieval. All embeddings are trained jointly through a combination of PoS-aware and PoS-agnostic losses. Our proposal enables learning specialised embedding spaces that offer multiple views of the same embedded entities.We report the first retrieval results on fine-grained actions for the large-scale EPIC dataset, in a generalised zero-shot setting. Results show the advantage of our approach for both video-to-text and text-to-video action retrieval. We also demonstrate the benefit of disentangling the PoS for the generic task of cross-modal video retrieval on the MSR-VTT dataset.

show abstract

“…The CCG is based on a scene traits that when a specific object in an image is found, the scene is very likely to belong to a particular class associated with the object. The CCG is inspired by context gating [31] and the CCM [9]. The concept of CCG is depicted in Fig.…”

Section: Fusion Of Object Feature and Scene Featurementioning

confidence: 99%

“…where denotes element-wise multiplication; W and b are the trainable parameters; x ob ject→scene is a pseudo scene feature obtained by converting the object feature into the scene feature through CCM, and σ (x) = 1 1+exp(−x) is a sigmoid function. The structure of CCG is motivated by context gating [31]. The context gating transforms the input feature into a new feature using a self-gating mechanism, and it demonstrated significant improvements in video understanding tasks.…”

Section: Fusion Of Object Feature and Scene Featurementioning

confidence: 99%

FOSNet: An End-to-End Trainable Deep Neural Network for Scene Recognition

2020

View full text Add to dashboard Cite

Scene recognition is an image recognition problem aimed at predicting the category of the place at which the image is taken. In this paper, a new scene recognition method using the convolutional neural network (CNN) is proposed. The proposed method is based on the fusion of the object and the scene information in the given image and the CNN framework is named as FOS (fusion of object and scene) Net. In addition, a new loss named scene coherence loss (SCL) is developed to train the FOSNet and to improve the scene recognition performance. The proposed SCL is based on the unique traits of the scene that the 'sceneness' spreads and the scene class does not change all over the image. The proposed FOSNet was experimented with three most popular scene recognition datasets, and their state-ofthe-art performance is obtained in two sets: 60.14% on Places 2 and 90.37% on MIT indoor 67. The second highest performance of 77.28% is obtained on SUN 397.

show abstract

Learnable Pooling Methods for Video Classification

Cited by 88 publications

References 55 publications

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings

FOSNet: An End-to-End Trainable Deep Neural Network for Scene Recognition

Contact Info

Product

Resources

About