2020
DOI: 10.48550/arxiv.2004.00163
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Weakly-Supervised Action Localization with Expectation-Maximization Multi-Instance Learning

Abstract: Weakly-supervised action localization problem requires training a model to localize the action segments in the video given only video level action label. It can be solved under the Multiple Instance Learning (MIL) framework, where a bag (video) contains multiple instances (action segments). Since only the bag's label is known, the main challenge is to assign which key instances within the bag trigger the bag's label. Most previous models use an attention-based approach. These models use attention to generate b… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
12
1

Year Published

2020
2020
2023
2023

Publication Types

Select...
8

Relationship

1
7

Authors

Journals

citations
Cited by 11 publications
(13 citation statements)
references
References 39 publications
0
12
1
Order By: Relevance
“…DGAM [116] addressed the actioncontext confusion through imposing different attentions on different features with a generative model. EM-MIL [165] employed Expectation-Maximization to capture complete action instances and outperformed DGAM [116] on THUMOS14 dataset. Direct Action Localization.…”
Section: Methods With Limited Supervisionmentioning
confidence: 99%
See 1 more Smart Citation
“…DGAM [116] addressed the actioncontext confusion through imposing different attentions on different features with a generative model. EM-MIL [165] employed Expectation-Maximization to capture complete action instances and outperformed DGAM [116] on THUMOS14 dataset. Direct Action Localization.…”
Section: Methods With Limited Supervisionmentioning
confidence: 99%
“…DGAM [116] EM-MIL [165] + Conditional VAE / Expectation-Maximization to separate actions from context frames and capture complete action instances. -Not modeling temporal dependencies and relation between sub-actions.…”
Section: Generative Modelmentioning
confidence: 99%
“…There are some methods have noticed the importance of foreground-action consistency. For example, Refine-Loc [29] generated snippet-level hard pseudo labels by expanding previous detection results, TSCN [47] generated pseudo ground truth from the foreground attention sequence, and EM-MIL [21] put the pseudo-label generation into an expectation-maximization framework. There are also some methods [27,24,10] attempted to use class activations as a top-down supervision to guide foreground attention generation.…”
Section: Related Workmentioning
confidence: 99%
“…For example, to decrease the intra-class variance, 3C-Net [28] and A2CL-PT [27] maintain a set of class center and RPN [12] learns class-specific prototypes. The third type of works are based on a class-agnostic attention mechanism [29,30,16,34,10,52,25], which can consider both the challenges simultaneously. Unlike the second type of works, the attention here is generated in a bottom-up way from the raw data and trained for highlighting foreground segments.…”
Section: Related Workmentioning
confidence: 99%
“…And based on the observation that background features differ from action features, DGAM [34] adopts a conditional variation auto-encoder to construct different feature distributions conditioned on the attention. Recently, TSCN [52] and EM-MIL [25] fuse the output of different modalities (RGB and optical flow) to generate pseudo labels for guidance of the attention.…”
Section: Related Workmentioning
confidence: 99%