2020
DOI: 10.48550/arxiv.2008.05787
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Shift Equivariance in Object Detection

Abstract: Robustness to small image translations is a highly desirable property for object detectors. However, recent works have shown that CNN-based classifiers are not shift invariant. It is unclear to what extent this could impact object detection, mainly because of the architectural differences between the two and the dimensionality of the prediction space of modern detectors. To assess shift equivariance of object detection models end-to-end, in this paper we propose an evaluation metric, built upon a greedy search… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 22 publications
0
3
0
Order By: Relevance
“…One most relevant work [7] to ours is from the NLP field: the authors found universally transferable sparse matching subnetworks (at 40% to 90% sparsity), from the pre-trained initialization of BERT models. Finding their work inspiring, we stress that transplanting their NLP findings to our CV models is highly nontrivial due to multiple barriers: (1) pretraining BERT uses only a self-supervised objective called "masked language model" (MLM) [16], while pre-training CV models has a significant variety of popular options, ranging from the supervised fashion [40], to self-supervision yet with numerous objectives [17,34,8]; (2) BERT models consist of self-attention and fully-connected sub-layers, differing much from the standard convolutional architectures in CV; (3) further complicating the issue is that different CV downstream tasks are known to rely on different priors and invariances; for example, while classification often calls on shift invariance, detection assumes location shift equivariance [70,49]. That questions the feasibility of asking for one mask to transfer among them all.…”
Section: Related Workmentioning
confidence: 99%
“…One most relevant work [7] to ours is from the NLP field: the authors found universally transferable sparse matching subnetworks (at 40% to 90% sparsity), from the pre-trained initialization of BERT models. Finding their work inspiring, we stress that transplanting their NLP findings to our CV models is highly nontrivial due to multiple barriers: (1) pretraining BERT uses only a self-supervised objective called "masked language model" (MLM) [16], while pre-training CV models has a significant variety of popular options, ranging from the supervised fashion [40], to self-supervision yet with numerous objectives [17,34,8]; (2) BERT models consist of self-attention and fully-connected sub-layers, differing much from the standard convolutional architectures in CV; (3) further complicating the issue is that different CV downstream tasks are known to rely on different priors and invariances; for example, while classification often calls on shift invariance, detection assumes location shift equivariance [70,49]. That questions the feasibility of asking for one mask to transfer among them all.…”
Section: Related Workmentioning
confidence: 99%
“…Image translations as small as one pixel can result in a radically different image representation at the deepest layers of state-of-the-art CNNs [2], which means that CNNs can struggle to generalize to the wide range of translations seen in video data. In fact, small translations of input images can be effective adversarial attacks on CNNs [2,11,29]. It is also important to note that CNNs are often trained on datasets like ImageNet [10] that have demonstrable location bias: the photographed objects' locations are not equally distributed throughout the dataset, and traditional data augmentation strategies do not sufficiently address this problem [2,11].…”
Section: Related Workmentioning
confidence: 99%
“…assumption does not always hold. For instance, an autonomous car should not flip predictions for the same object between consecutive video frames due to marginal spatial shift or image noise [15], [16]. Recently, aliasing has been identified as one of the main reasons behind CNN's lack of robustness -especially against small image transformations, such as shift [2], [3].…”
Section: Introductionmentioning
confidence: 99%