2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020
DOI: 10.1109/cvpr42600.2020.01134
|View full text |Cite
|
Sign up to set email alerts
|

Noise-Aware Fully Webly Supervised Object Detection

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
13
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 34 publications
(14 citation statements)
references
References 43 publications
1
13
0
Order By: Relevance
“…Large scale image-text datasets crawled from the internet have been widely used in pre-training. As indicated by (Carlini and Terzis 2021;Northcutt, Jiang, and Chuang 2021;Shen et al 2020), excessive noisy data negatively impacts the model's performance and training efficiency. ALIGN (Jia et al 2021) and WenLan (Huo et al 2021) demonstrate that the large-scale pre-training with expensive resources can suppress the influence of noise to some extent, but these training resources are usually not available for general researchers.…”
Section: Ensemble Confident Learningmentioning
confidence: 99%
See 2 more Smart Citations
“…Large scale image-text datasets crawled from the internet have been widely used in pre-training. As indicated by (Carlini and Terzis 2021;Northcutt, Jiang, and Chuang 2021;Shen et al 2020), excessive noisy data negatively impacts the model's performance and training efficiency. ALIGN (Jia et al 2021) and WenLan (Huo et al 2021) demonstrate that the large-scale pre-training with expensive resources can suppress the influence of noise to some extent, but these training resources are usually not available for general researchers.…”
Section: Ensemble Confident Learningmentioning
confidence: 99%
“…First, as the scale of data expands, pre-training requires expensive training resources, CLIP (Radford et al 2021) costs 3584 GPUdays and WenLan (Huo et al 2021) costs 896 GPU-days both on NVIDIA A100. Second, the raw internet data is noisy (as shown in Appendix, Figure 7), which wastes training resources and extremely degrades the performance of model (Algan and Ulusoy 2021; Carlini and Terzis 2021;Northcutt, Jiang, and Chuang 2021;Shen et al 2020). Third, previous multi-modal pre-training methods only use limited image-text pairs, while ignoring richer single-modal text data, which results in poor generalization to many downstream NLP tasks and scenes (Li et al 2020).…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Furthermore, Carion et al [40] regard object detection as a direct set prediction problem and propose an end-to-end detection framework based on the encoder-decoder of transformer; Shen et al [41] exploit residual structure and spatial sensitive entropy to reduce the negative impact of web images with noisy labels to a certain extent; Dai et al [42] propose a dynamic detection head framework by unifying attention and object detection head; Wang et al [43] propose a prediction-aware one-to-one label assignment to replace nonmaximum suppression postprocessing and achieve performance comparable to NMS.…”
Section: Related Workmentioning
confidence: 99%
“…However, the learning of novel knowledge is still limited to the scale of welllabeled training data. More recently, several methods [7]- [9] have alternatively considered utilizing web data as auxiliary information to enhance the model performance on the source dataset. For example, Schroff et al [10] utilize a multi-modal approach that combines the text, metadata, and visual features to obtain candidate images from web pages.…”
Section: Introductionmentioning
confidence: 99%