Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation)

Hou, Yunzhong; Zheng, Liang

doi:10.1145/3474085.3475310

Cited by 34 publications

(15 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They propose a new barebones model which addresses the poor generalization exhibited due to over-fitting individual scenes and camera configurations. Human detection in overlapping regions has also been investigated by Hou et al [7]. They proposed a new multi-view detector, MVDeTr, where the detector fuses multi-view information by introducing a shadow transformer.…”

Section: Related Workmentioning

confidence: 99%

Deep Learning Based Cross-View Human Detection System

Zhang

Yang

Gao

et al. 2023

J. Phys.: Conf. Ser.

View full text Add to dashboard Cite

Human detection is an important research branch in the field of computer vision, and it is widely used in vehicle-assisted driving, intelligent monitoring, intelligent transportation and other aspects. In order to solve the problems of poor real-time performance, low detection accuracy and inability to detect under occlusion in traditional human detection methods. This paper proposes a deep learning based trainable cross-view human detection system, where the encoder-decoder network focuses on solving the feature association problem after perspective transformation. Then, the system was tested on the WildTrack dataset. The results show that our cross-view human detection system outperforms conventional systems in terms of speed and accuracy across the board, achieving a satisfactory performance rate of 0.71 MODA in the presence of extensive occlusion.

show abstract

Section: Related Workmentioning

confidence: 99%

Deep Learning Based Cross-View Human Detection System

Zhang

Yang

Gao

et al. 2023

J. Phys.: Conf. Ser.

View full text Add to dashboard Cite

show abstract

“…Recent work explicitly addressing multi-object detection using contemporary detection architectures is limited [13]- [16], [19], [20]. Nassar et al [19] apply a convolutional neural network that takes multi-view images and corresponding geolocation information as inputs and uses a joint loss function considering all views, resulting in an increase of the detection mAP by up to 27.8%.…”

Section: A Multi-view Object Detectionmentioning

confidence: 99%

“…Furthermore, Liu et al [18] improve the detection accuracy with the Swin Transformer, adding multi-scale feature maps and reducing the ViT complexity from O(n 2 ) to O(n) by implementing a shifted-window self-attention pattern. A recent work from Hou and Zheng [16] addresses multi-view pedestrian detection by using a DETR architecture with multi-view attention. In order to account for spatial consistency, they use a projective transform to the common ground plane.…”

Section: B Transformers For Object Detectionmentioning

confidence: 99%

“…This is of interest in scenarios where objects can be highly occluded in one view, but are clearer in another view, such as in multi-camera visual surveillance, autonomous vehicle sensing solutions and multi-view X-ray security screening. Although some works have addressed multi-view object detection [13]- [16], detailed consideration of this task remains fairly limited. Furthermore, the use of modern architectures based on attention has not been investigated thoroughly, leading us to propose a novel architecture based on a Transformer decoder that uses the feature representations across multiple concurrent views to improve detection accuracy.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Multi-view Vision Transformers for Object Detection

Isaac-Medina

Willcocks

Breckon

2022

2022 26th International Conference on Pattern Recognition (ICPR)

View full text Add to dashboard Cite

Object detection has been thoroughly investigated during the last decade using deep neural networks. However, the inclusion of additional information given by multiple concurrent views of the same scene has not received much attention. In scenarios where objects may appear in obscure poses from certain view points, the use of differing simultaneous views can improve object detection. Therefore, we propose a multi-view fusion network to enrich the backbone features of standard object detection architectures across multiple source and target view points. Our method consists of a transformer decoder for the target view that combines the remaining source views feature maps. In this way, the feature representation of the target view can aggregate feature information from the source view through attention. Our architecture is detector-agnostic, meaning it can be applied across any existing detection backbone. We evaluate performance using YOLOX, Deformable DETR and Swin Transformer baseline detectors, comparing standard single view performance against the addition of our multi-view transformer architecture. Our method achieves a 3% increase of the COCO AP over a four view X-ray security dataset and a slight 0.7% increase on a seven view pedestrian dataset. We demonstrate that the integration of different views using attention-based networks improves the detection performance of multi-view datasets. 1

show abstract

“…3D pose can be estimated [10,36] by merging 2D skeleton estimations from multiple 2D camera views, using a 3D regression network or graph matching. Meanwhile, multi-view person detection approaches [19,20,28,34] also utilize camera calibration to merge multiple 2D detections or features to generate more reliable 3D person detection results. The accuracy of these approaches heavily depends on the quality of the 2D person detection or 2D pose estimation.…”

Section: Related Workmentioning

confidence: 99%

MMPTRACK: Large-scale Densely Annotated Multi-camera Multiple People Tracking Benchmark

Han¹,

You²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Multi-camera tracking systems are gaining popularity in applications that demand high-quality tracking results, such as frictionless checkout because monocular multi-object tracking (MOT) systems often fail in cluttered and crowded environments due to occlusion. Multiple highly overlapped cameras can significantly alleviate the problem by recovering partial 3D information. However, the cost of creating a high-quality multi-camera tracking dataset with diverse camera settings and backgrounds has limited the dataset scale in this domain. In this paper, we provide a largescale densely-labeled multi-camera tracking dataset in five different environments with the help of an auto-annotation system. The system uses overlapped and calibrated depth and RGB cameras to build a high-performance 3D tracker that automatically generates the 3D tracking results. The 3D tracking results are projected to each RGB camera view using camera parameters to create 2D tracking results. Then, we manually check and correct the 3D tracking results to ensure the label quality, which is much cheaper than fully manual annotation. We have conducted extensive experiments using two real-time multi-camera trackers and a person re-identification (ReID) model with different settings. This dataset provides a more reliable benchmark of multi-camera, multi-object tracking systems in cluttered and crowded environments. Also, our results demonstrate that adapting the trackers and ReID models on this dataset significantly improves their performance. Our dataset will be publicly released upon the acceptance of this work.

show abstract

Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation)

Cited by 34 publications

References 39 publications

Deep Learning Based Cross-View Human Detection System

Deep Learning Based Cross-View Human Detection System

Multi-view Vision Transformers for Object Detection

MMPTRACK: Large-scale Densely Annotated Multi-camera Multiple People Tracking Benchmark

Contact Info

Product

Resources

About