Scene Consistency Representation Learning for Video Scene Segmentation

Wu, Haoqian; Chen, Keyu; Luo, Yukun; Qiao, Ruizhi; Ren, Bo; Liu, Haozhe; Xie, Weicheng; Shen, Linlin

doi:10.1109/cvpr52688.2022.01363

Cited by 17 publications

(36 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We constructed a machine learning pipeline (Figure 2) analyze the video in both frame-and object-level, and identify keyframes. Noticeably, although models like [24,71] can detect scene transitions in an end-to-end way, their priority is based on the visual similarity of different scenes. On the contrary, our aim is to offer users a richer exploration of varied objects, ensuring that keyframes are densely populated throughout the video for comprehensive exploration.…”

Section: Keyframe Detection and Description Generation Pipelinementioning

confidence: 99%

SPICA: Interactive Video Content Exploration through Augmented Audio Descriptions for Blind or Low-Vision Viewers

Ning,

Wimer,

Jiang

et al. 2024

Proceedings of the CHI Conference on Human Factors in Computing Systems

View full text Add to dashboard Cite

Blind or Low-Vision (BLV) users often rely on audio descriptions (AD) to access video content. However, conventional static ADs can leave out detailed information in videos, impose a high mental load, neglect the diverse needs and preferences of BLV users, and lack immersion. To tackle these challenges, we introduce Spica, an AI-powered system that enables BLV users to interactively explore video content. Informed by prior empirical studies on BLV video consumption, Spica offers interactive mechanisms for supporting temporal navigation of frame captions and spatial exploration of objects within key frames. Leveraging an audio-visual machine learning pipeline, Spica augments existing ADs by adding interactivity, spatial sound effects, and individual object descriptions without requiring additional human annotation. Through a user study with 14 BLV participants, we evaluated the usability and usefulness of Spica and explored user behaviors, preferences, and mental models when interacting with augmented ADs. CCS CONCEPTS• Human-centered computing → Auditory feedback; Accessibility technologies; Accessibility systems and tools; • Computing methodologies → Scene understanding.

show abstract

Section: Keyframe Detection and Description Generation Pipelinementioning

confidence: 99%

SPICA: Interactive Video Content Exploration through Augmented Audio Descriptions for Blind or Low-Vision Viewers

Ning,

Wimer,

Jiang

et al. 2024

Proceedings of the CHI Conference on Human Factors in Computing Systems

View full text Add to dashboard Cite

show abstract

“…Subsequently, they maximize the similarity between the query and the positive key while minimizing the query’s similarity with a set of randomly selected shots. For the positive key selection, Wu et al suggested the scene consistency selection approach [ 34 ], which enables the selection to accomplish a more challenging goal. They create a soft positive sample using query-specific individual information and an online clustering of samples in a batch to produce a positive sample.…”

Section: Related Workmentioning

confidence: 99%

Video Scene Detection Using Transformer Encoding Linker Network (TELNet)

Tseng

Yeh

et al. 2023

Sensors

View full text Add to dashboard Cite

This paper introduces a transformer encoding linker network (TELNet) for automatically identifying scene boundaries in videos without prior knowledge of their structure. Videos consist of sequences of semantically related shots or chapters, and recognizing scene boundaries is crucial for various video processing tasks, including video summarization. TELNet utilizes a rolling window to scan through video shots, encoding their features extracted from a fine-tuned 3D CNN model (transformer encoder). By establishing links between video shots based on these encoded features (linker), TELNet efficiently identifies scene boundaries where consecutive shots lack links. TELNet was trained on multiple video scene detection datasets and demonstrated results comparable to other state-of-the-art models in standard settings. Notably, in cross-dataset evaluations, TELNet demonstrated significantly improved results (F-score). Furthermore, TELNet’s computational complexity grows linearly with the number of shots, making it highly efficient in processing long videos.

show abstract

“…http://kaldir.vc.in.tum.de/scannet benchmark/data efficient/ Pre-training for 3D Representation Learning. Many recent works propose to pre-train networks on source datasets with auxiliary tasks such as low-level point cloud geometric registration [27], 3D local structural prediction [78], the completion of the occluded point clouds [79], and the foregroundbackground feature discrimination [58], with effective learning strategies such as contrastive learning [27] and masked generative modelling [80], [81]. Then they finetune the weights of the trained networks for the downstream target tasks to boost their performances.…”

Section: Related Workmentioning

confidence: 99%

“…The convex decomposition [103] is conducted in an approximate manner to perform 3D scene parsing on the object parts. More approaches [104] have been proposed recently, which utilize class prototypes and masked point cloud modeling [81], [105], [106] to learn informative representations for downstream 3D scene understanding. To sum up, although approaches have been proposed to alleviate the data efficiency problem, the models for weakly supervised learning lack the capacity to recognize novel categories beyond the labeled training set.…”

Section: Related Workmentioning

confidence: 99%

RM3D: Robust Data-Efficient 3D Scene Parsing via Traditional and Learnt 3D Descriptors-Based Semantic Region Merging

Liu

2022

Int J Comput Vis

View full text Add to dashboard Cite

Existing state-of-the-art 3D point cloud understanding methods merely perform well in a fully supervised manner. To the best of our knowledge, there exists no unified framework that simultaneously solves the downstream high-level understanding tasks including both segmentation and detection, especially when labels are extremely limited. This work presents a general and simple framework to tackle point cloud understanding when labels are limited. The first contribution is that we have done extensive methodology comparisons of traditional and learned 3D descriptors for the task of weakly supervised 3D scene understanding, and validated that our adapted traditional PFH-based 3D descriptors show excellent generalization ability across different domains. The second contribution is that we proposed a learning-based region merging strategy based on the affinity provided by both the traditional/learned 3D descriptors and learned semantics. The merging process takes both low-level geometric and high-level semantic feature correlations into consideration. Experimental results demonstrate that our framework has the best performance among the three most important weakly supervised point clouds understanding tasks including semantic segmentation, instance segmentation, and object detection even when very limited number of points are labeled. Our method, termed Region Merging 3D (RM3D), has superior performance on Scan-Net data-efficient learning online benchmarks and other four large-scale 3D understanding benchmarks under various experimental settings, outperforming current arts by a margin for various 3D understanding tasks without complicated learning strategies such as active learning.

show abstract

Scene Consistency Representation Learning for Video Scene Segmentation

Cited by 17 publications

References 28 publications

SPICA: Interactive Video Content Exploration through Augmented Audio Descriptions for Blind or Low-Vision Viewers

SPICA: Interactive Video Content Exploration through Augmented Audio Descriptions for Blind or Low-Vision Viewers

Video Scene Detection Using Transformer Encoding Linker Network (TELNet)

RM3D: Robust Data-Efficient 3D Scene Parsing via Traditional and Learnt 3D Descriptors-Based Semantic Region Merging

Contact Info

Product

Resources

About