Tracking by Natural Language Specification

Li, Zhenyang; Tao, Ran; Gavves, Efstratios; Snoek, Cees G. M.; Smeulders, A.W.M.

doi:10.1109/cvpr.2017.777

Cited by 106 publications

(172 citation statements)

References 30 publications

Supporting

Mentioning

165

Contrasting

Order By: Relevance

“…Moreover, to transfer textual information to the visual domain, we rely on dynamic convolutional filters as earlier used in [27,15]. Unlike static convolutional filters that are used in conventional neural networks, dynamic filters are generated depending on the input, in our case on the encoded sentence representation.…”

Section: Language Encoding As Dynamic Filtersmentioning

confidence: 99%

MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment

Zhang

Dai

Wang

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

264

232

View full text Add to dashboard Cite

This research strives for natural language moment retrieval in long, untrimmed video streams. The problem is not trivial especially when a video contains multiple moments of interests and the language describes complex temporal dependencies, which often happens in real scenarios. We identify two crucial challenges: semantic misalignment and structural misalignment. However, existing approaches treat different moments separately and do not explicitly model complex moment-wise temporal relations. In this paper, we present Moment Alignment Network (MAN), a novel framework that unifies the candidate moment encoding and temporal structural reasoning in a single-shot feed-forward network. MAN naturally assigns candidate moment representations aligned with language semantics over different temporal locations and scales. Most importantly, we propose to explicitly model moment-wise temporal relations as a structured graph and devise an iterative graph adjustment network to jointly learn the best structure in an end-to-end manner. We evaluate the proposed approach on two challenging public benchmarks DiDeMo and Charades-STA, where our MAN significantly outperforms the state-of-the-art by a large margin.

show abstract

Section: Language Encoding As Dynamic Filtersmentioning

confidence: 99%

MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment

Zhang

Dai

Wang

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

264

232

View full text Add to dashboard Cite

show abstract

“…Indeed, in [3], object detection with NLU evolved into instance segmentation using referring expressions. We review the state-of-theart on the task of segmentation based on natural language expressions [3][4] [5], highlighting the main contributions in the fusion of multimodal information, and then compare them against our approach.…”

Section: Related Workmentioning

confidence: 99%

“…Tracking by Natural Language Specification [5]. In this paper, the main task is object tracking in video sequences.…”

Section: Recurrent Multimodal Interactionmentioning

confidence: 99%

“…In this paper, we introduce a modular neural network architecture that divides the task into several sub-tasks, each handling a different type of information in a specific manner. Our approach is similar to [3], [4] and [5] in that we extract visual and natural language information in an independent manner by employing networks commonly used for these types of data, i.e., CNNs and RNNs, and then focus on processing this multi-domain information by means of another neural network, yielding an end-to-end trainable architecture. However, our method also introduces the usage of Simple Recurrent Units (SRUs) for efficient segmentation based on referring expressions, a Synthesis Module that processes the linguistic and visual information jointly, and an Upsampling Module that outputs highly detailed segmentation maps.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Dynamic Multimodal Instance Segmentation Guided by Natural Language Queries

Margffoy-Tuay

Pérez

Botero

et al. 2018

Computer Vision – ECCV 2018

122

View full text Add to dashboard Cite

We address the problem of segmenting an object given a natural language expression that describes it. Current techniques tackle this task by either (i) directly or recursively merging linguistic and visual information in the channel dimension and then performing convolutions; or by (ii) mapping the expression to a space in which it can be thought of as a filter, whose response is directly related to the presence of the object at a given spatial coordinate in the image, so that a convolution can be applied to look for the object. We propose a novel method that integrates these two insights in order to fully exploit the recursive nature of language. Additionally, during the upsampling process, we take advantage of the intermediate information generated when downsampling the image, so that detailed segmentations can be obtained. We compare our method against the state-of-the-art approaches in four standard datasets, in which it surpasses all previous methods in six of eight of the splits for this task.

show abstract

“…In this paper, we propose a new representation of videos that, as in the first examples, encodes the data in a general and contentagnostic manner, resulting in a long-term, robust motion representation applicable not only to action recognition, but to other video analysis tasks as well [39], [64]. This new representation distills the motion information contained in all the frames of a video into a single image, which we call the dynamic image.…”

Section: Introductionmentioning

confidence: 99%

Action Recognition with Dynamic Image Networks

Bilen

Fernando

Gavves

et al. 2018

IEEE Trans. Pattern Anal. Mach. Intell.

Self Cite

180

163

View full text Add to dashboard Cite

Abstract-We introduce the concept of dynamic image, a novel compact representation of videos useful for video analysis, particularly in combination with convolutional neural networks (CNNs). A dynamic image encodes temporal data such as RGB or optical flow videos by using the concept of 'rank pooling'. The idea is to learn a ranking machine that captures the temporal evolution of the data and to use the parameters of the latter as a representation. When a linear ranking machine is used, the resulting representation is in the form of an image, which we call dynamic because it summarizes the video dynamics in addition of appearance. This is a powerful idea because it allows to convert any video to an image so that existing CNN models pre-trained for the analysis of still images can be immediately extended to videos. We also present an efficient and effective approximate rank pooling operator, accelerating standard rank pooling algorithms by orders of magnitude, and formulate that as a CNN layer. This new layer allows generalizing dynamic images to dynamic feature maps. We demonstrate the power of the new representations on standard benchmarks in action recognition achieving state-of-the-art performance.

show abstract

Tracking by Natural Language Specification

Abstract: Abstract

Cited by 106 publications

References 30 publications

MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment

MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment

Dynamic Multimodal Instance Segmentation Guided by Natural Language Queries

Action Recognition with Dynamic Image Networks

Contact Info

Product

Resources

About