BMN: Boundary-Matching Network for Temporal Action Proposal Generation

Lin, Tianwei; Liu, Xiao; Li, Xin; Ding, Errui; Wen, Shifeng

doi:10.1109/iccv.2019.00399

Cited by 561 publications

(562 citation statements)

References 29 publications

Supporting

Mentioning

547

Contrasting

Order By: Relevance

“…For clarification, +ODTC denotes first performing ordering-dependency regularization then executing taskconsistency method, while +TCOD is the other way round. (4) R-C3D [74], BSN [40] and BMN [39] with TC and OD. We further plugged our TC and OD methods into these action detection models to verify their generalization ability.…”

Section: Evaluation On Step Localizationmentioning

confidence: 99%

“…(2) Can the proposed task-consistency and orderingdependency methods be applied to other action detection models? Since our proposed TC and OD are two plug-and-play methods, we futher validate them on the R-C3D [74], BSN [40] and BMN [39] models. From Table 8 we can see that both TC and OD could improve the performance of various basic models, which further demonstrate the effectiveness of our proposed methods.…”

Section: Evaluation On Step Localizationmentioning

confidence: 99%

“…Then we employed the backbone to the SSN model [80] for the Breakfast, JIGSAWS or UNLV-diving dataset. Besides, we include the results based on two state-of-the-art methods (i.e., BMN [39] and BSN [40]) to see if the improvements are subtle or significant.…”

Section: Cross Dataset Transfermentioning

confidence: 99%

See 2 more Smart Citations

Comprehensive Instructional Video Analysis: The COIN Dataset and Performance Evaluation

Tang

Lu²,

Zhou³

2021

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Thanks to the substantial and explosively inscreased instructional videos on the Internet, novices are able to acquire knowledge for completing various tasks. Over the past decade, growing efforts have been devoted to investigating the problem on instructional video analysis. However, the most existing datasets in this area have limitations in diversity and scale, which makes them far from many real-world applications where more diverse activities occur. To address this, we present a large-scale dataset named as "COIN" for COmprehensive INstructional video analysis. Organized with a hierarchical structure, the COIN dataset contains 11,827 videos of 180 tasks in 12 domains (e.g., vehicles, gadgets, etc.) related to our daily life. With a new developed toolbox, all the videos are annotated efficiently with a series of step labels and the corresponding temporal boundaries. In order to provide a benchmark for instructional video analysis, we evaluate plenty of approaches on the COIN dataset under five different settings. Furthermore, we exploit two important characteristics (i.e., task-consistency and ordering-dependency) for localizing important steps in instructional videos. Accordingly, we propose two simple yet effective methods, which can be easily plugged into conventional proposal-based action detection models. We believe the introduction of the COIN dataset will promote the future in-depth research on instructional video analysis for the community. Our dataset, annotation toolbox and source code are available at http://coin-dataset.github.io. DomainTaskStep VehiclesHousehold Items Change the Car Tire {unscrew the screws, jack up the car, remove the tire, put on the tire, tighten the screws } {remove the door knob, remove bolt and pin board, install new pin board, install new bolt, install new door knob } Replace the Door Knob

show abstract

Section: Evaluation On Step Localizationmentioning

confidence: 99%

Section: Evaluation On Step Localizationmentioning

confidence: 99%

See 1 more Smart Citation

Comprehensive Instructional Video Analysis: The COIN Dataset and Performance Evaluation

Tang

Lu²,

Zhou³

2021

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

show abstract

“…Instead, we employ an average pooling layer to turn thef into a series of consecutive basic momentsf base ∈ R L×d , where L ≪ T is the numbers of basic moments. With these low-resolution basic moments, we construct the 2D candidate map F 0 c ∈ R L×L×d as the candidate-level representation inspired from [20,48]. Specifically, we denote the (i, j) t h element of F 0 c as F 0 c i j ∈ R d .…”

Section: 32mentioning

confidence: 99%

Dual Path Interaction Network for Video Moment Localization

Wang

Zha

Chen

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Video moment localization aims to localize a specific moment in a video by a natural language query. Previous works either use alignment information to find out the best-matching candidate (i.e., topdown approach) or use discrimination information to predict the temporal boundaries of the match (i.e., bottom-up approach). Little research has taken both the candidate-level alignment information and frame-level boundary information together and considers the complementarity between them. In this paper, we propose a unified top-down and bottom-up approach called Dual Path Interaction Network (DPIN), where the alignment and discrimination information are closely connected to jointly make the prediction. Our model includes a boundary prediction pathway encoding the frame-level representation and an alignment pathway extracting the candidatelevel representation. The two branches of our network predict two complementary but different representations for moment localization. To enforce the consistency and strengthen the connection between the two representations, we propose a semantically conditioned interaction module. The experimental results on three popular benchmarks (i.e., TACoS, Charades-STA, and Activity-Caption) demonstrate that the proposed approach effectively localizes the relevant moment and outperforms the state-of-the-art approaches. CCS CONCEPTS • Information systems → Video search; Novelty in information retrieval.

show abstract

“…CDC [16] predicts per-frame confidence scores using 3D convolutional neural networks. BSN [10] and BMN [9] adopt 2D convolutions to estimate actionness, starting time, and ending time at each frame. These methods can be applicable to informative channel identification by using their per-channel classification as a measure of channel importance.…”

Section: Related Workmentioning

confidence: 99%

Channel Embedding for Informative Protein Identification from Highly Multiplexed Images

Magid

Jang

Schapiro

et al. 2020

Preprint

View full text Add to dashboard Cite

Interest is growing rapidly in using deep learning to classify biomedical images, and interpreting these deep-learned models is necessary for life-critical decisions and scientific discovery. Effective interpretation techniques accelerate biomarker discovery and provide new insights into the etiology, diagnosis, and treatment of disease. Most interpretation techniques aim to discover spatially-salient regions within images, but few techniques consider imagery with multiple channels of information. For instance, highly multiplexed tumor and tissue images have 30-100 channels and require interpretation methods that work across many channels to provide deep molecular insights. We propose a novel channel embedding method that extracts features from each channel. We then use these features to train a classifier for prediction. Using this channel embedding, we apply an interpretation method to rank the most discriminative channels. To validate our approach, we conduct an ablation study on a synthetic dataset. Moreover, we demonstrate that our method aligns with biological findings on highly multiplexed images of breast cancer cells while outperforming baseline pipelines.

show abstract

BMN: Boundary-Matching Network for Temporal Action Proposal Generation

Cited by 561 publications

References 29 publications

Comprehensive Instructional Video Analysis: The COIN Dataset and Performance Evaluation

Comprehensive Instructional Video Analysis: The COIN Dataset and Performance Evaluation

Dual Path Interaction Network for Video Moment Localization

Channel Embedding for Informative Protein Identification from Highly Multiplexed Images

Contact Info

Product

Resources

About