2022
DOI: 10.1109/access.2022.3184031
|View full text |Cite
|
Sign up to set email alerts
|

Video Sparse Transformer With Attention-Guided Memory for Video Object Detection

Abstract: Typical text recognition methods rely on an encoderdecoder structure, in which the encoder extracts features from an image, and the decoder produces recognized text from these features. In this study, we propose a simpler and more effective method for text recognition, known as the Decoder-only Transformer for Optical Character Recognition (DTrOCR). This method uses a decoder-only Transformer to take advantage of a generative language model that is pre-trained on a large corpus. We examined whether a generativ… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(6 citation statements)
references
References 103 publications
(108 reference statements)
0
2
0
Order By: Relevance
“…The DETR model has been applied to many downstream tasks. The combination of attention mechanisms with Transformers has been applied in video object detection tasks and has achieved good results [27]. Ickler et al [28] discussed the feasibility of using the DETR model for volumetric medical object detection.…”
Section: End-to-end Object Detection With Transformersmentioning
confidence: 99%
“…The DETR model has been applied to many downstream tasks. The combination of attention mechanisms with Transformers has been applied in video object detection tasks and has achieved good results [27]. Ickler et al [28] discussed the feasibility of using the DETR model for volumetric medical object detection.…”
Section: End-to-end Object Detection With Transformersmentioning
confidence: 99%
“…DEFA showed the inefficiency of the First In First Out (FIFO) memory structure and proposed a diversity-aware memory, which uses object-level memory instead of frame-level memory for the attention module. VSTAM [107] improves feature quality on an element-by-element basis and then performs sparse aggregation before these enhanced features are used for object candidate region detection. The model also incorporates external memory to take advantage of long-term contextual information.…”
Section: Spatio-temporal Informationmentioning
confidence: 99%
“…However, because they are based on two-stage object detectors such as Faster-RCNN [18], their performance heavily depends on the quality of the initial object suggestions extracted from a region proposal network (RPN). To address this shortcoming, pixel-level attention methods have been investigated [11], [19]. They perform pixel-level attention between the feature pixels of the current image and those of the reference image, such that each current feature pixel has more pertinent information and makes a better region proposal.…”
Section: Introductionmentioning
confidence: 99%
“…They perform pixel-level attention between the feature pixels of the current image and those of the reference image, such that each current feature pixel has more pertinent information and makes a better region proposal. Some methods [12], [19] leverage a sparse style of pixel-level attention to reduce computation. However, pixel-level attention-based methods still suffer from a low processing speed because of the computation of a large number of feature pixels generated per image.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation