Video Sparse Transformer With Attention-Guided Memory for Video Object Detection

Fujitake, Masato; Sugimoto, Akihiro

doi:10.1109/access.2022.3184031

Cited by 5 publications

(6 citation statements)

References 103 publications

(108 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The DETR model has been applied to many downstream tasks. The combination of attention mechanisms with Transformers has been applied in video object detection tasks and has achieved good results [27]. Ickler et al [28] discussed the feasibility of using the DETR model for volumetric medical object detection.…”

Section: End-to-end Object Detection With Transformersmentioning

confidence: 99%

DyFish-DETR: Underwater Fish Image Recognition Based on Detection Transformer

Wang,

Ruan,

Chen

2024

JMSE

View full text Add to dashboard Cite

Due to the complexity of underwater environments and the lack of training samples, the application of target detection algorithms to the underwater environment has yet to provide satisfactory results. It is crucial to design specialized underwater target recognition algorithms for different underwater tasks. In order to achieve this goal, we created a dataset of freshwater fish captured from multiple angles and lighting conditions, aiming to improve underwater target detection of freshwater fish in natural environments. We propose a method suitable for underwater target detection, called DyFish-DETR (Dynamic Fish Detection with Transformers). In DyFish-DETR, we propose a DyFishNet (Dynamic Fish Net) to better extract fish body texture features. A Slim Hybrid Encoder is designed to fuse fish body feature information. The results of ablation experiments show that DyFishNet can effectively improve the mean Average Precision (mAP) of model detection. The Slim Hybrid Encoder can effectively improve Frame Per Second (FPS). Both DyFishNet and the Slim Hybrid Encoder can reduce model parameters and Floating Point Operations (FLOPs). In our proposed freshwater fish dataset, DyFish-DETR achieved a mAP of 96.6%. The benchmarking experimental results show that the Average Precision (AP) and Average Recall (AR) of DyFish-DETR are higher than several state-of-the-art methods. Additionally, DyFish-DETR, respectively, achieved 99%, 98.8%, and 83.2% mAP in other underwater datasets.

show abstract

Section: End-to-end Object Detection With Transformersmentioning

confidence: 99%

DyFish-DETR: Underwater Fish Image Recognition Based on Detection Transformer

Wang,

Ruan,

Chen

2024

JMSE

View full text Add to dashboard Cite

show abstract

“…DEFA showed the inefficiency of the First In First Out (FIFO) memory structure and proposed a diversity-aware memory, which uses object-level memory instead of frame-level memory for the attention module. VSTAM [107] improves feature quality on an element-by-element basis and then performs sparse aggregation before these enhanced features are used for object candidate region detection. The model also incorporates external memory to take advantage of long-term contextual information.…”

Section: Spatio-temporal Informationmentioning

confidence: 99%

Extended Expectation Maximization for Under-Fitted Models

Rekavandi¹,

SEGHOUANE²,

Boussaid³

et al. 2022

Preprint

View full text Add to dashboard Cite

<p>In this paper, we generalize the well-known Expectation Maximization (EM) algorithm using the α− divergence for Gaussian Mixture Model (GMM). This approach is used in robust subspace detection when the number of parameters is kept small to avoid overfitting and large estimation variances. The level of robustness can be tuned by the parameterα. When α→1, our method is equivalent to the standard EM approach and for α <1 the method is robust against potential outliers. Simulation results show that the method outperforms the standard EM when it comes to mismatches between noise models and their realizations. In addition, we use the proposed method to detect active brain areas using collected functional Magnetic Resonance Imaging (fMRI) data during task-related experiments.</p>

show abstract

“…However, because they are based on two-stage object detectors such as Faster-RCNN [18], their performance heavily depends on the quality of the initial object suggestions extracted from a region proposal network (RPN). To address this shortcoming, pixel-level attention methods have been investigated [11], [19]. They perform pixel-level attention between the feature pixels of the current image and those of the reference image, such that each current feature pixel has more pertinent information and makes a better region proposal.…”

Section: Introductionmentioning

confidence: 99%

“…They perform pixel-level attention between the feature pixels of the current image and those of the reference image, such that each current feature pixel has more pertinent information and makes a better region proposal. Some methods [12], [19] leverage a sparse style of pixel-level attention to reduce computation. However, pixel-level attention-based methods still suffer from a low processing speed because of the computation of a large number of feature pixels generated per image.…”

Section: Introductionmentioning

confidence: 99%

“…However, pixel-level attention-based methods still suffer from a low processing speed because of the computation of a large number of feature pixels generated per image. Afterward, DETR-based methods were proposed [13], [14], [19] and achieved high performance utilizing transformer-like architecture and deformable attention. however they still use pixel-level attention and require high computational cost.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

DiffusionVID: Denoising Object Boxes With Spatio–Temporal Conditioning for Video Object Detection

Roh,

Chung

2023

IEEE Access

View full text Add to dashboard Cite

Several existing still image object detectors suffer from image deterioration in videos, such as motion blur, camera defocus, and partial occlusion. We present DiffusionVID, a diffusion model-based video object detector, that exploits spatio-temporal conditioning. Inspired by the diffusion model, DiffusionVID refines random noise boxes to obtain the original object boxes in a video sequence. To effectively refine the box from the degraded images in the videos, we used three novel approaches: cascade refinement, dynamic core-set conditioning, and local batch refinement. The cascade refinement architecture effectively collects information from object regions, whereas the dynamic core-set conditioning further improves the denoising quality using adaptive conditional guidance based on the spatio-temporal core-set. Local batch refinement significantly improves the refinement speed by exploiting GPU parallelism. On the standard and widely used ImageNet-VID benchmark, our DiffusionVID with the ResNet-101 and Swin-Base backbones achieves 86.

show abstract

Video Sparse Transformer With Attention-Guided Memory for Video Object Detection

Cited by 5 publications

References 103 publications

DyFish-DETR: Underwater Fish Image Recognition Based on Detection Transformer

DyFish-DETR: Underwater Fish Image Recognition Based on Detection Transformer

Extended Expectation Maximization for Under-Fitted Models

DiffusionVID: Denoising Object Boxes With Spatio–Temporal Conditioning for Video Object Detection

Contact Info

Product

Resources

About