Where Are You Looking?

Jin, Yili; Liu, Junhua; Wang, Fangxin; Cui, Shuguang

doi:10.1145/3503161.3548200

Cited by 12 publications

(3 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Text-to-video Retrieval. Video analysis (Wang et al 2023(Wang et al , 2022Zeng et al 2022;Liu et al 2023b,a;Jin et al 2022) has recently gained much attention due to the increasing video data on the Internet. Among them, the text-to-video retrieval (T2VR) task (Dong, Li, and Snoek 2018;Chen et al 2020;Li et al 2019;Faghri et al 2017;Gao et al 2023;Lei, Berg, and Bansal 2021;Li et al 2023) aims to retrieve relevant videos from a set of pre-trimmed video clips given a text description.…”

Section: Related Workmentioning

confidence: 99%

GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval

Wang,

Chen

et al. 2024

AAAI

View full text Add to dashboard Cite

Given a text query, partially relevant video retrieval (PRVR) seeks to find untrimmed videos containing pertinent moments in a database. For PRVR, clip modeling is essential to capture the partial relationship between texts and videos. Current PRVR methods adopt scanning-based clip construction to achieve explicit clip modeling, which is information-redundant and requires a large storage overhead. To solve the efficiency problem of PRVR methods, this paper proposes GMMFormer, a Gaussian-Mixture-Model based Transformer which models clip representations implicitly. During frame interactions, we incorporate Gaussian-Mixture-Model constraints to focus each frame on its adjacent frames instead of the whole video. Then generated representations will contain multi-scale clip information, achieving implicit clip modeling. In addition, PRVR methods ignore semantic differences between text queries relevant to the same video, leading to a sparse embedding space. We propose a query diverse loss to distinguish these text queries, making the embedding space more intensive and contain more semantic information. Extensive experiments on three large-scale video datasets (i.e., TVR, ActivityNet Captions, and Charades-STA) demonstrate the superiority and efficiency of GMMFormer.

show abstract

Section: Related Workmentioning

confidence: 99%

GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval

Wang,

Chen

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“…The VR headset has a built-in accelerometer and we are able to easily calculate the current headset position (X,Y,Z) and the rotation of the headset (yaw, pitch, and roll). Besides, gaze information is also important as it provides more fine-grained features [12]. For the gaze data collection, we rely on the built-in eye tracker in the headset with a sample rate of 144 Hz.…”

Section: Data Collection Proceduresmentioning

confidence: 99%

“…The key difference between volumetric video compared with traditional 2D flat video [4,5] lies in the 3D representation, where the commonly used formats are point cloud, mesh, voxel, and the recent implicit neural representation. Among all these representations, point cloud is currently the most popular due to its simplicity and easy deployment [6].…”

Section: Introductionmentioning

confidence: 99%

The Shenzhen-Hong Kong Dialectics

Hu¹

2020

The Shenzhen Phenomenon

View full text Add to dashboard Cite

Recent years have witnessed a rapid development of immersive multimedia which bridges the gap between the real world and virtual space. Volumetric videos, as an emerging representative 3D video paradigm that empowers extended reality, stand out to provide unprecedented immersive and interactive video watching experience. Despite the tremendous potential, the research towards 3D volumetric video is still in its infancy, relying on sufficient and complete datasets for further exploration. However, existing related volumetric video datasets mostly only include a single object, lacking details about the scene and the interaction between them. In this paper, we focus on the current most widely used data format, point cloud, and for the first time release a fullscene volumetric video dataset that includes multiple people and their daily activities interacting with the external environments. Comprehensive dataset description and analysis are conducted, with potential usage of this dataset. The dataset and additional tools can be accessed via the following website: https://cuhkszinml.github.io/full_scene_volumetric_video_dataset/.

show abstract