“…The Panoptic Narrative Grounding (PNG) task is rapidly gaining prominence as a critical area of research in the multimodal domain [11,36,37,52,58,59]. This task aims to generate a pixel-level mask for each noun present in a given long sentence, providing a more fine-grained understanding compared to other cross-modal tasks, such as image captioning [6,35,42,51,62], visual question answering [23,47,57,73], and referring expression comprehension/segmentation [5,19,[28][29][30]33].…”