2019
DOI: 10.1007/978-3-030-20870-7_8
|View full text |Cite
|
Sign up to set email alerts
|

Video Object Segmentation with Language Referring Expressions

Abstract: Most state-of-the-art semi-supervised video object segmentation methods rely on a pixel-accurate mask of a target object provided for the first frame of a video. However, obtaining a detailed segmentation mask is expensive and time-consuming. In this work we explore an alternative way of identifying a target object, namely by employing language referring expressions. Besides being a more practical and natural way of pointing out a target object, using language specifications can help to avoid drift as well as … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

4
106
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
4
2

Relationship

0
10

Authors

Journals

citations
Cited by 79 publications
(110 citation statements)
references
References 52 publications
4
106
0
Order By: Relevance
“…Referring expression recognition. Grounding a short phrase or a sentence into a visual modality such as video (Khoreva et al, 2018;Anayurt et al, 2019) or imagery (Kong et al, 2014;Plummer et al, 2015Plummer et al, , 2018Yu et al, 2018a) is a well studied problem in intelligent user interfaces (Chai et al, 2004), human-robot interaction (Fang et al, 2012;Chai et al, 2014;Williams et al, 2016), and situated dialogue (Kennington and Schlangen, 2017). Kazemzadeh et al (2014), Hu et al (2017a), and Mao et al (2016) introduce two benchmark datasets for the real-world 2D images.…”
Section: Related Workmentioning
confidence: 99%
“…Referring expression recognition. Grounding a short phrase or a sentence into a visual modality such as video (Khoreva et al, 2018;Anayurt et al, 2019) or imagery (Kong et al, 2014;Plummer et al, 2015Plummer et al, , 2018Yu et al, 2018a) is a well studied problem in intelligent user interfaces (Chai et al, 2004), human-robot interaction (Fang et al, 2012;Chai et al, 2014;Williams et al, 2016), and situated dialogue (Kennington and Schlangen, 2017). Kazemzadeh et al (2014), Hu et al (2017a), and Mao et al (2016) introduce two benchmark datasets for the real-world 2D images.…”
Section: Related Workmentioning
confidence: 99%
“…As a result, this method adapted well to the changes of the foreground objects in the video. Khoreva et al presented a method using language referring expressions to identify a target object for video object segmentation [15]. Given referring expression, they first localized the target object via the grounding model and enforced temporal consistency of bounding boxes across frames.…”
Section: Related Workmentioning
confidence: 99%
“…Figure 7 shows an example of the video referring expression comprehension. Another approach by Khoreva et al (2018) explored Language Referring Expressions to…”
Section: Video Referring Expression Comprehension and Generation -Intromentioning
confidence: 99%