Noise-Aware Fully Webly Supervised Object Detection

Shen, Yunhang; Ji, Rongrong; Chen, Zhiwei; Hong, Xiaopeng; Zheng, Feng; Liu, Jianzhuang; Xu, Mingliang; Tian, Qi

doi:10.1109/cvpr42600.2020.01134

Cited by 34 publications

(14 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Large scale image-text datasets crawled from the internet have been widely used in pre-training. As indicated by (Carlini and Terzis 2021;Northcutt, Jiang, and Chuang 2021;Shen et al 2020), excessive noisy data negatively impacts the model's performance and training efficiency. ALIGN (Jia et al 2021) and WenLan (Huo et al 2021) demonstrate that the large-scale pre-training with expensive resources can suppress the influence of noise to some extent, but these training resources are usually not available for general researchers.…”

Section: Ensemble Confident Learningmentioning

confidence: 99%

“…First, as the scale of data expands, pre-training requires expensive training resources, CLIP (Radford et al 2021) costs 3584 GPUdays and WenLan (Huo et al 2021) costs 896 GPU-days both on NVIDIA A100. Second, the raw internet data is noisy (as shown in Appendix, Figure 7), which wastes training resources and extremely degrades the performance of model (Algan and Ulusoy 2021; Carlini and Terzis 2021;Northcutt, Jiang, and Chuang 2021;Shen et al 2020). Third, previous multi-modal pre-training methods only use limited image-text pairs, while ignoring richer single-modal text data, which results in poor generalization to many downstream NLP tasks and scenes (Li et al 2020).…”

Section: Introductionmentioning

confidence: 99%

“…The work is finished when Jue Wang and Weijia Wu are interns. There are numerous works (Northcutt, Jiang, and Chuang 2021;Shen et al 2020;Xie et al 2020) that contribute to training noise-robust models or developing noise-free strategies for vision tasks. In contrast, few studies have focused on handling noise in cross-modal pre-training.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

Wang,

Deng

et al. 2021

Preprint

View full text Add to dashboard Cite

While large scale pre-training has achieved great achievements in bridging the gap between vision and language, it still faces several challenges. First, the cost for pre-training is expensive. Second, there is no efficient way to handle the data noise which degrades model performance. Third, previous methods only leverage limited image-text paired data, while ignoring richer single-modal data, which may result in poor generalization to single-modal downstream tasks. In this work, we propose an EfficientCLIP method via Ensemble Confident Learning to obtain a less noisy data subset. Extra rich non-paired single-modal text data is used for boosting the generalization of text branch. We achieve the state-of-theart performance on Chinese cross-modal retrieval tasks with only 1/10 training resources compared to CLIP and WenLan, while showing excellent generalization to single-modal tasks including text retrieval and text classification.

show abstract

Section: Ensemble Confident Learningmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

Wang,

Deng

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Furthermore, Carion et al [40] regard object detection as a direct set prediction problem and propose an end-to-end detection framework based on the encoder-decoder of transformer; Shen et al [41] exploit residual structure and spatial sensitive entropy to reduce the negative impact of web images with noisy labels to a certain extent; Dai et al [42] propose a dynamic detection head framework by unifying attention and object detection head; Wang et al [43] propose a prediction-aware one-to-one label assignment to replace nonmaximum suppression postprocessing and achieve performance comparable to NMS.…”

Section: Related Workmentioning

confidence: 99%

Human Detection via Image Denoising for 5G‐Enabled Intelligent Applications

Zhou

Liang

et al. 2021

Wireless Communications and Mobile Computing

View full text Add to dashboard Cite

5G technology strongly supports the development of various intelligent applications, such as intelligent video surveillance and autonomous driving. And the human detection technology in intelligent video surveillance has also ushered in new challenges. A number of video images will be compressed for efficient transmission; the resulting incomplete feature representation of images will drop the human detection performance. Therefore, in this work, we propose a new human detection method based on compressed denoising. We exploit the quality factor in the compressed image and incorporate the pixel_shuffle inverse transform based on FFDNet to effectively improve the performance of image compression denoising, then HRNet and HRFPN are used to extract and fuse high-resolution features of denoised images, respectively, to obtain high-quality feature representation, and finally, a cascaded object detector is used for classification and bounding box regression to further improve object detection performance. At last, the experimental results on PASCAL VOC show that the proposed method effectively removes the compression noise and further detects human objects with multiple scales and different postures. Compared with the state-of-the-art methods, our method achieved better detection performance and is, therefore, more suited for human detection tasks.

show abstract

“…However, the learning of novel knowledge is still limited to the scale of welllabeled training data. More recently, several methods [7]- [9] have alternatively considered utilizing web data as auxiliary information to enhance the model performance on the source dataset. For example, Schroff et al [10] utilize a multi-modal approach that combines the text, metadata, and visual features to obtain candidate images from web pages.…”

Section: Introductionmentioning

confidence: 99%

BiSPL: Bidirectional Self-Paced Learning for Recognition From Web Data

Chang

Lai

et al. 2021

IEEE Trans. on Image Process.

Self Cite

View full text Add to dashboard Cite

Deep learning (DL) is inherently subject to the requirement of a large amount of well-labeled data, which is expensive and time-consuming to obtain manually. In order to broaden the reach of DL, leveraging free web data becomes an attractive strategy to alleviate the issue of data scarcity. However, directly utilizing collected web data to train a deep model is ineffective because of the mixed noisy data. To address such problems, we develop a novel bidirectional self-paced learning (BiSPL) framework which reduces the effect of noise by learning from web data in a meaningful order. Technically, the BiSPL framework consists of two essential steps. Relying on distances defined between web samples and labeled source samples, first, the web samples with short distances are sampled and combined to form a new training set. Second, based on the new training set, both easy and hard samples are initially employed to train deep models for higher stability, and hard samples are gradually dropped to reduce the noise as the training progresses. By iteratively alternating such steps, deep models converge to a better solution. We mainly focus on the fine-grained visual classification (FGVC) tasks because their corresponding datasets are generally small and therefore face a more significant data scarcity problem. Experiments conducted on six public FGVC tasks demonstrate that our proposed method outperforms the state-of-the-art approaches. Especially, BiSPL suffices to achieve the highest stable performance when the scale of the well-labeled training set decreases dramatically.

show abstract

Noise-Aware Fully Webly Supervised Object Detection

Cited by 34 publications

References 43 publications

EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

Human Detection via Image Denoising for 5G‐Enabled Intelligent Applications

BiSPL: Bidirectional Self-Paced Learning for Recognition From Web Data

Contact Info

Product

Resources

About