We propose an end-to-end approach for phrase grounding in images. Unlike prior methods that typically attempt to ground each phrase independently by building an imagetext embedding, our architecture formulates grounding of multiple phrases as a sequential and contextual process. Specifically, we encode region proposals and all phrases into two stacks of LSTM cells, along with so-far grounded phrase-region pairs. These LSTM stacks collectively capture context for grounding of the next phrase. The resulting architecture, which we call SeqGROUND, supports many-to-many matching by allowing an image region to be matched to multiple phrases and vice versa. We show competitive performance on the Flickr30K benchmark dataset and, through ablation studies, validate the efficacy of sequential grounding as well as individual design choices in our model architecture.
The alignment of heterogeneous sequential data (video to text) is an important and challenging problem. Standard techniques for this task, including Dynamic Time Warping (DTW) and Conditional Random Fields (CRFs), suffer from inherent drawbacks. Mainly, the Markov assumption implies that, given the immediate past, future alignment decisions are independent of further history. The separation between similarity computation and alignment decision also prevents end-to-end training. In this paper, we propose an end-to-end neural architecture where alignment actions are implemented as moving data between stacks of Long Short-term Memory (LSTM) blocks. This flexible architecture supports a large variety of alignment tasks, including one-to-one, one-to-many, skipping unmatched elements, and (with extensions) non-monotonic alignment. Extensive experiments on semi-synthetic and real datasets show that our algorithm outperforms state-of-the-art baselines. * The technique was conceived when all authors worked for Disney Research.arXiv:1803.00057v2 [cs.CV] 9 Apr 2018Elrond addresses the council. Frodo steps forward and moves towards a stone plinth.He places the ring on the plinth and returns to his seat.Boromir turns sharply. Frodo looks at someone questioningly. null null
In a photo, motion blur can be used as an artistic style to convey motion and to direct attention. In panning or tracking shots, a moving object of interest is followed by the camera during a relatively long exposure. The goal is to get a blurred background while keeping the object sharp. Unfortunately, it can be difficult to impossible to precisely follow the object. Often, many attempts or specialized physical setups are needed. This paper presents a novel approach to create such images. For capturing, the user is only required to take a casually recorded hand‐held video that roughly follows the object. Our algorithm then produces a single image which simulates a stabilized long time exposure. This is achieved by first warping all frames such that the object of interest is aligned to a reference frame. Then, optical flow based frame interpolation is used to reduce ghosting artifacts from temporal undersampling. Finally, the frames are averaged to create the result. As our method avoids segmentation and requires little to no user interaction, even challenging sequences can be processed successfully. In addition, artistic control is available in a number of ways. The effect can also be applied to create videos with an exaggerated motion blur. Results are compared with previous methods and ground truth simulations. The effectiveness of our method is demonstrated by applying it to hundreds of datasets. The most interesting results are shown in the paper and in the supplemental material.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.