2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
DOI: 10.1109/iccv48922.2021.01327
|View full text |Cite
|
Sign up to set email alerts
|

Relaxed Transformer Decoders for Direct Action Proposal Generation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
58
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 123 publications
(59 citation statements)
references
References 30 publications
1
58
0
Order By: Relevance
“…A two-stage approach for TAL first generates candidate video segments as action proposals, and further classify the proposals into action categories and refine their temporal boundaries. Several previous works focused on action proposal generation, by either classifying anchor windows [8,9,22] or detecting action boundaries [26,36,38,47,84], and more recently using a graph representation [4,76] or Transformers [13,59,67]. Others have integrated proposal generation and classification into a single model [14,55,56,85].…”
Section: Related Workmentioning
confidence: 99%
“…A two-stage approach for TAL first generates candidate video segments as action proposals, and further classify the proposals into action categories and refine their temporal boundaries. Several previous works focused on action proposal generation, by either classifying anchor windows [8,9,22] or detecting action boundaries [26,36,38,47,84], and more recently using a graph representation [4,76] or Transformers [13,59,67]. Others have integrated proposal generation and classification into a single model [14,55,56,85].…”
Section: Related Workmentioning
confidence: 99%
“…In practice, decoder-only models in vision are mostly used to autoregressively decode captions describing visual input data. Prompting video frames will mix two very different representations in the decoder, so CNNs substitute the encoder to provide context [51], [52], [53]. Both encoderonly and decoder-only layers consist of SA and FF sublayers interleaved with Add+Norm after each of them.…”
Section: Transformer Trends Adopted For Videomentioning
confidence: 99%
“…Instead, relative positional embeddings (RPE) signal the position of one token relative to another, and can also be fixed or learned (see [73] for more details). Lately there has been a growth on works adopting RPE [12], [52], [60], [74]. We will discuss this in more detail in Sec.…”
Section: Transformer Trends Adopted For Videomentioning
confidence: 99%
See 2 more Smart Citations