Video Object Segmentation with Language Referring Expressions

Khoreva, Anna; Rohrbach, Anna; Schiele, Bernt

doi:10.1007/978-3-030-20870-7_8

Cited by 79 publications

(110 citation statements)

References 52 publications

Supporting

Mentioning

106

Contrasting

Order By: Relevance

“…Referring expression recognition. Grounding a short phrase or a sentence into a visual modality such as video (Khoreva et al, 2018;Anayurt et al, 2019) or imagery (Kong et al, 2014;Plummer et al, 2015Plummer et al, , 2018Yu et al, 2018a) is a well studied problem in intelligent user interfaces (Chai et al, 2004), human-robot interaction (Fang et al, 2012;Chai et al, 2014;Williams et al, 2016), and situated dialogue (Kennington and Schlangen, 2017). Kazemzadeh et al (2014), Hu et al (2017a), and Mao et al (2016) introduce two benchmark datasets for the real-world 2D images.…”

Section: Related Workmentioning

confidence: 99%

Refer360∘: A Referring Expression Recognition Dataset in 360: A Referring Expression Recognition Dataset in 360∘ Images Images

Cirik

Berg-Kirkpatrick

Morency

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

We propose a novel large-scale referring expression recognition dataset, Refer360°, consisting of 17,137 instruction sequences and ground-truth actions for completing these instructions in 360°scenes. Refer360°differs from existing related datasets in three ways. First, we propose a more realistic scenario where instructors and the followers have partial, yet dynamic, views of the scene -followers continuously modify their field-of-view (FoV) while interpreting instructions that specify a final target location. Second, instructions to find the target location consist of multiple steps for followers who will start at random FoVs. As a result, intermediate instructions are strongly grounded in object references and followers must identify intermediate FoVs to find the final target location correctly. Third, the target locations are neither restricted to predefined objects nor chosen by annotators; instead, they are distributed randomly across scenes. This "point anywhere" approach leads to more linguistically complex instructions, as shown in our analyses. Our examination of the dataset shows that Refer360°manifests linguistically rich phenomena in a language grounding task that poses novel challenges for computational modeling of language, vision, and navigation.

show abstract

Section: Related Workmentioning

confidence: 99%

Refer360∘: A Referring Expression Recognition Dataset in 360: A Referring Expression Recognition Dataset in 360∘ Images Images

Cirik

Berg-Kirkpatrick

Morency

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…As a result, this method adapted well to the changes of the foreground objects in the video. Khoreva et al presented a method using language referring expressions to identify a target object for video object segmentation [15]. Given referring expression, they first localized the target object via the grounding model and enforced temporal consistency of bounding boxes across frames.…”

Section: Related Workmentioning

confidence: 99%

Video Object Segmentation with Weakly Temporal Information

Zhang¹,

Yao²,

Jiang³

et al. 2019

KSII TIIS

View full text Add to dashboard Cite

Video object segmentation is a significant task in computer vision, but its performance is not very satisfactory. A method of video object segmentation using weakly temporal information is presented in this paper. Motivated by the phenomenon in reality that the motion of the object is a continuous and smooth process and the appearance of the object does not change much between adjacent frames in the video sequences, we use a feed-forward architecture with motion estimation to predict the mask of the current frame. We extend an additional mask channel for the previous frame segmentation result. The mask of the previous frame is treated as the input of the expanded channel after processing, and then we extract the temporal feature of the object and fuse it with other feature maps to generate the final mask. In addition, we introduce multi-mask guidance to improve the stability of the model. Moreover, we enhance segmentation performance by further training with the masks already obtained. Experiments show that our method achieves competitive results on DAVIS-2016 on single object segmentation compared to some state-of-the-art algorithms.

show abstract

“…Figure 7 shows an example of the video referring expression comprehension. Another approach by Khoreva et al (2018) explored Language Referring Expressions to…”

Section: Video Referring Expression Comprehension and Generation -Intromentioning

confidence: 99%

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

Mogadala

Kalimuthu

Klakow

2021

jair

View full text Add to dashboard Cite

Interest in Artificial Intelligence (AI) and its applications has seen unprecedented growth in the last few years. This success can be partly attributed to the advancements made in the sub-fields of AI such as machine learning, computer vision, and natural language processing. Much of the growth in these fields has been made possible with deep learning, a sub-area of machine learning that uses artificial neural networks. This has created significant interest in the integration of vision and language. In this survey, we focus on ten prominent tasks that integrate language and vision by discussing their problem formulation, methods, existing datasets, evaluation measures, and compare the results obtained with corresponding state-of-the-art methods. Our efforts go beyond earlier surveys which are either task-specific or concentrate only on one type of visual content, i.e., image or video. Furthermore, we also provide some potential future directions in this field of research with an anticipation that this survey stimulates innovative thoughts and ideas to address the existing challenges and build new applications.

show abstract

Video Object Segmentation with Language Referring Expressions

Cited by 79 publications

References 52 publications

Refer360∘: A Referring Expression Recognition Dataset in 360: A Referring Expression Recognition Dataset in 360∘ Images Images

Refer360∘: A Referring Expression Recognition Dataset in 360: A Referring Expression Recognition Dataset in 360∘ Images Images

Video Object Segmentation with Weakly Temporal Information

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

Contact Info

Product

Resources

About