Task-Driven Dynamic Fusion: Reducing Ambiguity in Video Description

Zhang, Xishan; Gao, Ke; Zhang, Yongdong; Zhang, Dongming; Li, Jintao; Tian, Qi

doi:10.1109/cvpr.2017.662

Cited by 65 publications

(34 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Video captioning is a widely studied problem in computer vision [22,23,24,25]. Most approaches use a CNN pre-trained on image classification or action recognition to generate features [25,24,23]. These methods, like the video understanding methods described above, utilize a frame-based feature aggregation (e.g.…”

Section: Related Workmentioning

confidence: 99%

ECO: Efficient Convolutional Network for Online Video Understanding

Zolfaghari

Singh

Brox

2018

Computer Vision – ECCV 2018

415

294

View full text Add to dashboard Cite

The state of the art in video understanding suffers from two problems: (1) The major part of reasoning is performed locally in the video, therefore, it misses important relationships within actions that span several seconds. (2) While there are local methods with fast perframe processing, the processing of the whole video is not efficient and hampers fast video retrieval or online classification of long-term activities. In this paper, we introduce a network architecture 1 that takes longterm content into account and enables fast per-video processing at the same time. The architecture is based on merging long-term content already in the network rather than in a post-hoc fusion. Together with a sampling strategy, which exploits that neighboring frames are largely redundant, this yields high-quality action classification and video captioning at up to 230 videos per second, where each video can consist of a few hundred frames. The approach achieves competitive performance across all datasets while being 10x to 80x faster than state-of-the-art methods.

show abstract

Section: Related Workmentioning

confidence: 99%

ECO: Efficient Convolutional Network for Online Video Understanding

Zolfaghari

Singh

Brox

2018

Computer Vision – ECCV 2018

415

294

View full text Add to dashboard Cite

show abstract

“…The substantial difference between our model and the others assessed confirms that EtENet-IRv2 succeeds in achieving excellent results without requiring an overly complex structure, e.g., the addition of new layers as in RecNet (row 11, Table 2), or the adoption of new learning mechanisms such as reinforcement learning as in PickNet (row 3, Table 3). Moreover, this shows that it is possible to obtain excellent results even when using roughly half the frames used in other competing approaches [36,33,38,30]. Our framework sets a new standard in terms of top performances in video captioning and, we believe, can much contribute to further progress in the field.…”

Section: Discussionmentioning

confidence: 79%

“…Additionally, this is done without resorting to fancy 3D CNN architectures, thus leaving huge scope for further improvements. Moreover, unlike [38,30,15,36] which all use more than 25 frames per video clip, our model only uses 16 frames, a significant contribution in terms of memory and computational cost.…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

End-to-End Video Captioning

Olivastri

Singh

Cuzzolin

2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

Building correspondences across different modalities, such as video and language, has recently become critical in many visual recognition applications, such as video captioning. Inspired by machine translation, recent models tackle this task using an encoder-decoder strategy. The (video) encoder is traditionally a Convolutional Neural Network (CNN), while the decoding (for language generation) is done using a Recurrent Neural Network (RNN). Current state-of-the-art methods, however, train encoder and decoder separately. CNNs are pretrained on object and/or action recognition tasks and used to encode video-level features. The decoder is then optimised on such static features to generate the video's description. This disjoint setup is arguably sub-optimal for input (video) to output (description) mapping.In this work, we propose to optimise both encoder and decoder simultaneously in an end-to-end fashion. In a twostage training setting, we first initialise our architecture using pre-trained encoders and decoders -then, the entire network is trained end-to-end in a fine-tuning stage to learn the most relevant features for video caption generation. In our experiments, we use GoogLeNet and Inception-ResNet-v2 as encoders and an original Soft-Attention (SA-) LSTM as a decoder. Analogously to gains observed in other computer vision problems, we show that end-to-end training significantly improves over the traditional, disjoint training process. We evaluate our End-to-End (EtENet) Networks on the Microsoft Research Video Description (MSVD) and the MSR Video to Text (MSR-VTT) benchmark datasets, showing how EtENet achieves state-of-the-art performance across the board.

show abstract

“…MSR-VTT [57] is a recently released dataset. We compare performance of our approach on this dataset with the latest published models such as Alto [42], RUC-UVA [15], TDDF [61], PickNet [13], M 3 -VC [54] and RecNet local [52]. The results are summarized in Table 4.…”

Section: Results On Msr-vtt Datasetmentioning

confidence: 99%

Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning

Aafaq

Akhtar

Liu

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

184

111

View full text Add to dashboard Cite

Automatic generation of video captions is a fundamental challenge in computer vision. Recent techniques typically employ a combination of Convolutional Neural Networks (CNNs) and Recursive Neural Networks (RNNs) for video captioning. These methods mainly focus on tailoring sequence learning through RNNs for better caption generation, whereas off-the-shelf visual features are borrowed from CNNs. We argue that careful designing of visual features for this task is equally important, and present a visual feature encoding technique to generate semantically rich captions using Gated Recurrent Units (GRUs). Our method embeds rich temporal dynamics in visual features by hierarchically applying Short Fourier Transform to CNN features of the whole video. It additionally derives high level semantics from an object detector to enrich the representation with spatial dynamics of the detected objects. The final representation is projected to a compact space and fed to a language model. By learning a relatively simple language model comprising two GRU layers, we establish new stateof-the-art on MSVD and MSR-VTT datasets for METEOR and ROUGE L metrics.

show abstract

Task-Driven Dynamic Fusion: Reducing Ambiguity in Video Description

Cited by 65 publications

References 24 publications

ECO: Efficient Convolutional Network for Online Video Understanding

ECO: Efficient Convolutional Network for Online Video Understanding

End-to-End Video Captioning

Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning

Contact Info

Product

Resources

About