2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
DOI: 10.1109/iccv48922.2021.00140
|View full text |Cite
|
Sign up to set email alerts
|

Panoptic Narrative Grounding

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
18
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
6
1
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 11 publications
(18 citation statements)
references
References 34 publications
0
18
0
Order By: Relevance
“…Tasks Related to VNG. Panoptic Narrative Grounding (PNG) [15] creates a panoptic segmentation that grounds the nouns of an input caption describing an image. In contrast, our proposed VNG operates on videos and focuses on concrete objects only.…”
Section: Related Workmentioning
confidence: 99%
“…Tasks Related to VNG. Panoptic Narrative Grounding (PNG) [15] creates a panoptic segmentation that grounds the nouns of an input caption describing an image. In contrast, our proposed VNG operates on videos and focuses on concrete objects only.…”
Section: Related Workmentioning
confidence: 99%
“…Specifically, the PNG task seeks to segment objects and regions in an image corresponding to nouns in its long text description. Numerous studies have been conducted on this task [10,13,53]. González et al [13] first introduced this new task, establishing a benchmark that includes new standard data and evaluation methods, and proposed a robust baseline method as the foundation for future work.…”
Section: Related Work 21 Panoptic Narrative Groundingmentioning
confidence: 99%
“…Following the labeling budget calculation in [24], on average, it takes approximately 79.1 seconds to segment a single mask. With each PNG example containing an average of 5.1 nouns requiring segmentation annotations [13], this time expenditure increases to 403.4 seconds. This considerable constraint hampers dataset expansion and further limits model performance.…”
Section: Introductionmentioning
confidence: 99%
“…Along the same line, various datasets can help to facilitate knowledge embedding associated with natural language ones such as CLIP [30], VisualComet [28] and VCR [41]. On the other hand, there are many text-based datasets that can be enriched with visual data such as [36], [9] and [21]. To this end, the next challenge for our framework is how to leverage such rich correlated information among datasets and learning tasks to automate the training algorithms to make it faster, more efficient and more robust in building AI component powered by L KG .…”
Section: A Case Study Of Vision Knowledge Graphmentioning
confidence: 99%