Panoptic Narrative Grounding

González, Cristina; Ayobi, Nicolás; Hernández, Isabela; Hernández, José Tiberio; Pont-Tuset, Jordi; Arbeláez, Pablo

doi:10.1109/iccv48922.2021.00140

Cited by 11 publications

(18 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Tasks Related to VNG. Panoptic Narrative Grounding (PNG) [15] creates a panoptic segmentation that grounds the nouns of an input caption describing an image. In contrast, our proposed VNG operates on videos and focuses on concrete objects only.…”

Section: Related Workmentioning

confidence: 99%

Connecting Vision and Language with Video Localized Narratives

Voigtlaender¹,

Changpinyo²,

Pont-Tuset³

et al. 2023

Preprint

View full text Add to dashboard Cite

We propose Video Localized Narratives, a new form of multimodal video annotations connecting vision and language. In the original Localized Narratives [40], annotators speak and move their mouse simultaneously on an image, thus grounding each word with a mouse trace segment. However, this is challenging on a video. Our new protocol empowers annotators to tell the story of a video with Localized Narratives, capturing even complex events involving multiple actors interacting with each other and with several passive objects. We annotated 20k videos of the OVIS, UVO, and Oops datasets, totalling 1.7M words. Based on this data, we also construct new benchmarks for the video narrative grounding and video question-answering tasks, and provide reference results from strong baseline models.Our annotations are available at https://google. github.io/video-localized-narratives/.

show abstract

Section: Related Workmentioning

confidence: 99%

Connecting Vision and Language with Video Localized Narratives

Voigtlaender¹,

Changpinyo²,

Pont-Tuset³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Specifically, the PNG task seeks to segment objects and regions in an image corresponding to nouns in its long text description. Numerous studies have been conducted on this task [10,13,53]. González et al [13] first introduced this new task, establishing a benchmark that includes new standard data and evaluation methods, and proposed a robust baseline method as the foundation for future work.…”

Section: Related Work 21 Panoptic Narrative Groundingmentioning

confidence: 99%

“…Following the labeling budget calculation in [24], on average, it takes approximately 79.1 seconds to segment a single mask. With each PNG example containing an average of 5.1 nouns requiring segmentation annotations [13], this time expenditure increases to 403.4 seconds. This considerable constraint hampers dataset expansion and further limits model performance.…”

Section: Introductionmentioning

confidence: 99%

Semi-Supervised Panoptic Narrative Grounding

Yang,

Ji,

Sun

et al. 2023

Proceedings of the 31st ACM International Conference on Multimedia

View full text Add to dashboard Cite

Despite considerable progress, the advancement of Panoptic Narrative Grounding (PNG) remains hindered by costly annotations. In this paper, we introduce a novel Semi-Supervised Panoptic Narrative Grounding (SS-PNG) learning scheme, capitalizing on a smaller set of labeled image-text pairs and a larger set of unlabeled pairs to achieve competitive performance. Unlike visual segmentation tasks, PNG involves one pixel belonging to multiple open-ended nouns. As a result, existing multi-class based semi-supervised segmentation frameworks cannot be directly applied to this task. To address this challenge, we first develop a novel SS-PNG Network (SS-PNG-NW) tailored to the SS-PNG setting. We thoroughly investigate strategies such as Burn-In and data augmentation to determine the optimal generic configuration for the SS-PNG-NW. Additionally, to tackle the issue of imbalanced pseudo-label quality,

show abstract

“…Along the same line, various datasets can help to facilitate knowledge embedding associated with natural language ones such as CLIP [30], VisualComet [28] and VCR [41]. On the other hand, there are many text-based datasets that can be enriched with visual data such as [36], [9] and [21]. To this end, the next challenge for our framework is how to leverage such rich correlated information among datasets and learning tasks to automate the training algorithms to make it faster, more efficient and more robust in building AI component powered by L KG .…”

Section: A Case Study Of Vision Knowledge Graphmentioning

confidence: 99%

Fantastic Data and How to Query Them

Tran¹,

Le-Tuan²,

Nguyen-Duc³

et al. 2022

Preprint

View full text Add to dashboard Cite

It is commonly acknowledged that the availability of the huge amount of (training) data is one of the most important factors for many recent advances in Artificial Intelligence (AI). However, datasets are often designed for specific tasks in narrow AI sub areas and there is no unified way to manage and access them. This not only creates unnecessary overheads when training or deploying Machine Learning models but also limits the understanding of the data, which is very important for data-centric AI. In this paper, we present our vision about a unified framework for different datasets so that they can be integrated and queried easily, e.g., using standard query languages. We demonstrate this in our ongoing work to create a framework for datasets in Computer Vision and show its advantages in different scenarios. Our demonstration is available at https://vision.semkg.org.

show abstract

Panoptic Narrative Grounding

Cited by 11 publications

References 34 publications

Connecting Vision and Language with Video Localized Narratives

Connecting Vision and Language with Video Localized Narratives

Semi-Supervised Panoptic Narrative Grounding

Fantastic Data and How to Query Them

Contact Info

Product

Resources

About