Context-aware Biaffine Localizing Network for Temporal Sentence Grounding

Liu, Daizong; Qu, Xiaoye; Dong, Jianfeng; Zhou, Pan; Cheng, Yu; Wei, Wei; Xu, Zichuan; Xie, Yulai

doi:10.1109/cvpr46437.2021.01108

Cited by 94 publications

(41 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…DORi [119] incorporates appearance features and captures the relations between objects and actions guided by query. CBLN [120] addresses TSGV from a new perspective. It reformulates TSGV by scoring all pairs of start and end indices simultaneously and predict moment with a biaffine structure.…”

Section: Span-based Methodsmentioning

confidence: 99%

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

Zhang¹,

Sun²,

Wei³

et al. 2022

Preprint

View full text Add to dashboard Cite

Temporal sentence grounding in videos (TSGV), a.k.a., natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that semantically corresponds to a language query from an untrimmed video. Connecting computer vision and natural language, TSGV has drawn significant attention from researchers in both communities. This survey attempts to provide a summary of fundamental concepts in TSGV and current research status, as well as future research directions. As the background, we present a common structure of functional components in TSGV, in a tutorial style: from feature extraction from raw video and language query, to answer prediction of the target moment. Then we review the techniques for multimodal understanding and interaction, which is the key focus of TSGV for effective alignment between the two modalities. We construct a taxonomy of TSGV techniques and elaborate methods in different categories with their strengths and weaknesses. Lastly, we discuss issues with the current TSGV research and share our insights about promising research directions.

show abstract

Section: Span-based Methodsmentioning

confidence: 99%

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

Zhang¹,

Sun²,

Wei³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…2) Query encoder Following previous works [16], [20], [78], we first employ the Glove model [79] to embed each word of the given sentence query into a dense vector. Then, we use multi-head self-attention [80] and Bi-GRU [81] modules to encode its sequential information.…”

Section: Video and Query Encoders 1) Video Encodermentioning

confidence: 99%

“…At last, we apply grounding heads on the feature H to predict the target segment semantically corresponding to the query information. There are many grounding heads proposed in recent years: proposal-ranking based grounding head [18], [20], [71], [78] and the boundary-regression grounding head [25]- [27]. In this paper, we follow the former one [18], [20], [74] to determine the target video segment with pre-defined segment proposals.…”

Section: F Grounding Headmentioning

confidence: 99%

See 1 more Smart Citation

Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for Temporal Sentence Grounding

Daizong¹,

Fang²,

Hu³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

“…State-of-the-art video grounding methods [11,12,14,25,32,34,35] have relied on existing benchmarks to design novel modules (e.g. proposal generation, context modeling, and multi-modality fusion).…”

Section: Related Workmentioning

confidence: 99%

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

Soldan¹,

Pardo²,

Alcázar³

et al. 2021

Preprint

View full text Add to dashboard Cite

The recent and increasing interest in video-language research has driven the development of large-scale datasets that enable data-intensive machine learning techniques. In comparison, limited effort has been made at assessing the fitness of these datasets for the video-language grounding task. Recent works have begun to discover significant limitations in these datasets, suggesting that state-of-the-art techniques commonly overfit to hidden dataset biases. In this work, we present MAD (Movie Audio Descriptions), a novel benchmark that departs from the paradigm of augmenting existing video datasets with text annotations and focuses on crawling and aligning available audio descriptions of mainstream movies. MAD contains over 384,000 natural language sentences grounded in over 1,200 hours of video and exhibits a significant reduction in the currently diagnosed biases for video-language grounding datasets. MAD's collection strategy enables a novel and more challenging version of video-language grounding, where short temporal moments (typically seconds long) must be accurately grounded in diverse long-form videos that can last up to three hours.

show abstract

Context-aware Biaffine Localizing Network for Temporal Sentence Grounding

Cited by 94 publications

References 46 publications

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for Temporal Sentence Grounding

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

Contact Info

Product

Resources

About