Boundary-sensitive Pre-training for Temporal Localization in Videos

Xu, Mantao; Pérez-Rúa, Juan-Manuel; Escorcia, Víctor; Martínez, Brais; Zhu, Xiatian; Zhang, Li; Ghanem, Bernard; Xiang, Tao

doi:10.48550/arxiv.2011.10830

Cited by 4 publications

(6 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Section: Video Encoders In Talmentioning

confidence: 99%

“…Very recent works [2,50] have exploited some of the aforementioned techniques for better pre-training of action localization models. For example, localization-tailored data augmentation and classification is adopted by [50]. However, these works introduce a large amount of extra video data and additional stream networks, both of which are expensive in terms of memory and computation.…”

Section: Video Encoders In Talmentioning

confidence: 99%

“…Video analysis has recently become an important area of research, encompassing multiple relevant problems such as action recognition [9,14], temporal action localization [1,8,13,20,50,51], and video question answering This standard training method leads to a task discrepancy issue with the video encoder -trained for video classification but used for TAL. To overcome this limitation, we introduce an extra stage in-between that optimizes both the video encoder and the TAL head end-to-end at a low temporal and/or spatial resolution (i.e., low-fidelity) subject to the same GPU memory constraints (bottom-middle circle).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Low-Fidelity End-to-End Video Encoder Pre-training for Temporal Action Localization

Pérez-Rúa²,

Zhu

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Temporal action localization (TAL) is a fundamental yet challenging task in video understanding. Existing TAL methods rely on pre-training a video encoder through action classification supervision. This results in a task discrepancy problem for the video encoder -trained for action classification, but used for TAL. Intuitively, end-to-end model optimization is a good solution. However, this is not operable for TAL subject to the GPU memory constraints, due to the prohibitive computational cost in processing long untrimmed videos. In this paper we resolve this challenge by introducing a novel low-fidelity end-to-end (LoFi-E2E) video encoder pre-training method. Instead of always using the full training configurations for TAL learning, we propose to reduce the mini-batch composition in terms of temporal, spatial or spatio-temporal resolution so that endto-end optimization for the video encoder becomes operable under the memory conditions of a mid-range hardware budget. Crucially, this enables the gradient to flow backwards through the video encoder from a TAL loss supervision, favourably solving the task discrepancy problem and providing more effective feature representations. Extensive experiments show that the proposed LoFi-E2E pre-training approach can significantly enhance the performance of existing TAL methods. Encouragingly, even with a lightweight ResNet18 based video encoder in a single RGB stream, our method surpasses two-stream ResNet50 based alternatives with expensive optical flow, often by a good margin.

show abstract

Section: Video Encoders In Talmentioning

confidence: 99%

Section: Video Encoders In Talmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Low-Fidelity End-to-End Video Encoder Pre-training for Temporal Action Localization

Pérez-Rúa²,

Zhu

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Intuitively, performing feature re-calibration for task-specific features is a way to tackle this problem. Instead of finetuning the feature extractor [2,8,48] with high time and computation cost, we explore to re-calibrate the features in a more efficient manner. In this work, our intuition is simple: the RGB and FLOW features contain modal-specific information (i.e., appearance and motion information) from different perspectives of the given data.…”

Section: Introductionmentioning

confidence: 99%

Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization

Hong

Feng

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision. Previous works use the appearance and motion features extracted from pre-trained feature encoder directly, e.g., feature concatenation or score-level fusion. In this work, we argue that the features extracted from the pre-trained extractors, e.g., I3D, which are trained for trimmed video action classification, but not specific for WS-TAL task, leading to inevitable redundancy and sub-optimization . Therefore, the feature re-calibration is needed for reducing the task-irrelevant information redundancy. Here, we propose a cross-modal consensus network (CO 2 -Net) to tackle this problem. In CO 2 -Net, we mainly introduce two identical proposed cross-modal consensus modules (CCM) that design a cross-modal attention mechanism to filter out the task-irrelevant information redundancy using the global information from the main modality and the cross-modal local information from the auxiliary modality. Moreover, we further explore inter-modality consistency, where we treat the attention weights derived from each CCM as the pseudo targets of the attention weights derived from another CCM to maintain the consistency between the predictions derived from two CCMs, forming a mutual learning manner. Finally, we conduct extensive experiments on two commonly used temporal action localization datasets, THUMOS14 and ActivityNet1.2, to verify our method, which we achieve the stateof-the-art results. The experimental results show that our proposed cross-modal consensus module can produce more representative features for temporal action localization.

show abstract

“…For temporal action detection task, we need to localize and classify the target actions simultaneously. Current mainstream approaches [10,20,19] are designed in a twostage pipeline, i.e., proposal generation and action classification, and have achieved remarkable performance. Therefore, we follow this paradigm to design the solution of this challenge.…”

Section: Introductionmentioning

confidence: 99%

Proposal Relation Network for Temporal Action Detection

Wang¹,

Qing²,

Huang³

et al. 2021

Preprint

View full text Add to dashboard Cite

This technical report presents our solution for temporal action detection task in AcitivityNet Challenge 2021. The purpose of this task is to locate and identify actions of interest in long untrimmed videos. The crucial challenge of the task comes from that the temporal duration of action varies dramatically, and the target actions are typically embedded in a background of irrelevant activities. Our solution builds on BMN [10], and mainly contains three steps: 1) action classification and feature encoding by Slowfast [6], CSN [13] and ViViT [1]; 2) proposal generation. We improve BMN by embedding the proposed Proposal Relation Network (PRN), by which we can generate proposals of high quality; 3) action detection. We calculate the detection results by assigning the proposals with corresponding classification results. Finally, we ensemble the results under different settings and achieve 44.7% on the test set, which improves the champion result in ActivityNet 2020 [17] by 1.9% in terms of average mAP.

show abstract

Boundary-sensitive Pre-training for Temporal Localization in Videos

Cited by 4 publications

References 0 publications

Low-Fidelity End-to-End Video Encoder Pre-training for Temporal Action Localization

Low-Fidelity End-to-End Video Encoder Pre-training for Temporal Action Localization

Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization

Proposal Relation Network for Temporal Action Detection

Contact Info

Product

Resources

About