Weilin Huang scite author profile

A family of loss functions built on pair-based computation have been proposed in the literature which provide a myriad of solutions for deep metric learning. In this paper, we provide a general weighting framework for understanding recent pair-based loss functions. Our contributions are three-fold: (1) we establish a General Pair Weighting (GPW) framework, which casts the sampling problem of deep metric learning into a unified view of pair weighting through gradient analysis, providing a powerful tool for understanding recent pair-based loss functions; (2) we show that with GPW, various existing pair-based methods can be compared and discussed comprehensively, with clear differences and key limitations identified; (3) we propose a new loss called multi-similarity loss (MS loss) under the GPW, which is implemented in two iterative steps (i.e., mining and weighting). This allows it to fully consider three similarities for pair weighting, providing a more principled approach for collecting and weighting informative pairs. Finally, the proposed MS loss obtains new state-of-the-art performance on four image retrieval benchmarks, where it outperforms the most recent approaches, such as ABE [14] and HTL [4], by a large margin, e.g., , and 80.9% → 88.0% on In-Shop Clothes Retrieval dataset at Recall@1. Code is available at https://github. com/MalongTech/research-ms-loss arXiv:1904.06627v3 [cs.CV]

show abstract

Detecting Text in Natural Image with Connectionist Text Proposal Network

Tian

Huang

et al. 2016

766

450

View full text Add to dashboard Cite

Abstract. We propose a novel Connectionist Text Proposal Network (CTPN) that accurately localizes text lines in natural image. The CTPN detects a text line in a sequence of fine-scale text proposals directly in convolutional feature maps. We develop a vertical anchor mechanism that jointly predicts location and text/non-text score of each fixed-width proposal, considerably improving localization accuracy. The sequential proposals are naturally connected by a recurrent neural network, which is seamlessly incorporated into the convolutional network, resulting in an end-to-end trainable model. This allows the CTPN to explore rich context information of image, making it powerful to detect extremely ambiguous text. The CTPN works reliably on multi-scale and multilanguage text without further post-processing, departing from previous bottom-up methods requiring multi-step post filtering. It achieves 0.88 and 0.61 F-measure on the ICDAR 2013 and 2015 benchmarks, surpassing recent results [8,35] by a large margin. The CTPN is computationally efficient with 0.14s/image, by using the very deep VGG16 model [27]. Online demo is available at: http://textdet.com/.

show abstract

Deep Metric Learning with Hierarchical Triplet Loss

et al. 2018

View full text Add to dashboard Cite

We present a novel hierarchical triplet loss (HTL) capable of automatically collecting informative training samples (triplets) via a defined hierarchical tree that encodes global context information. This allows us to cope with the main limitation of random sampling in training a conventional triplet loss, which is a central issue for deep metric learning. Our main contributions are two-fold. (i) we construct a hierarchical class-level tree where neighboring classes are merged recursively. The hierarchical structure naturally captures the intrinsic data distribution over the whole dataset. (ii) we formulate the problem of triplet collection by introducing a new violate margin, which is computed dynamically based on the designed hierarchical tree. This allows it to automatically select meaningful hard samples with the guide of global context. It encourages the model to learn more discriminative features from visual similar classes, leading to faster convergence and better performance. Our method is evaluated on the tasks of image retrieval and face recognition, where it outperforms the standard triplet loss substantially by 1%18%. It achieves new state-of-the-art performance on a number of benchmarks, with much fewer learning iterations.

show abstract

Robust Scene Text Detection with Convolution Neural Network Induced MSER Trees

2014

View full text Add to dashboard Cite

Abstract. Maximally Stable Extremal Regions (MSERs) have achieved great success in scene text detection. However, this low-level pixel operation inherently limits its capability for handling complex text information efficiently (e. g. connections between text or background components), leading to the difficulty in distinguishing texts from background components. In this paper, we propose a novel framework to tackle this problem by leveraging the high capability of convolutional neural network (CNN). In contrast to recent methods using a set of low-level heuristic features, the CNN network is capable of learning high-level features to robustly identify text components from text-like outliers (e.g. bikes, windows, or leaves). Our approach takes advantages of both MSERs and slidingwindow based methods. The MSERs operator dramatically reduces the number of windows scanned and enhances detection of the low-quality texts. While the sliding-window with CNN is applied to correctly separate the connections of multiple characters in components. The proposed system achieved strong robustness against a number of extreme text variations and serious real-world problems. It was evaluated on the ICDAR 2011 benchmark dataset, and achieved over 78% in F-measure, which is significantly higher than previous methods.

show abstract

CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images

et al. 2018

View full text Add to dashboard Cite

We present a simple yet efficient approach capable of training deep neural networks on large-scale weakly-supervised web images, which are crawled raw from the Internet by using text queries, without any human annotation. We develop a principled learning strategy by leveraging curriculum learning, with the goal of handling a massive amount of noisy labels and data imbalance effectively. We design a new learning curriculum by measuring the complexity of data using its distribution density in a feature space, and rank the complexity in an unsupervised manner. This allows for an efficient implementation of curriculum learning on large-scale web images, resulting in a highperformance CNN model, where the negative impact of noisy labels is reduced substantially. Importantly, we show by experiments that those images with highly noisy labels can surprisingly improve the generalization capability of the model, by serving as a manner of regularization. Our approaches obtain state-of-the-art performance on four benchmarks: WebVision, ImageNet, Clothing-1M and Food-101. With an ensemble of multiple models, we achieved a top-5 error rate of 5.2% on the WebVision challenge [18] for 1000-category classification. This result was the top performance by a wide margin, outperforming second place by a nearly 50% relative error rate. Code and models are available at: https://github.com/MalongTech/CurriculumNet.

show abstract

ClothFlow: A Flow-Based Model for Clothed Person Generation

et al. 2019

View full text Add to dashboard Cite

Cross-Batch Memory for Embedding Learning

et al. 2020

View full text Add to dashboard Cite

Earliest presence of humans in northeast Asia

Zhu

Hoffman

Potts

et al. 2001

Nature

186

148

View full text Add to dashboard Cite

The timing of the earliest habitation and oldest stone technologies in different regions of the world remains a contentious topic in the study of human evolution. Here we contribute to this debate with detailed magnetostratigraphic results on two exposed parallel sections of lacustrine sediments at Xiaochangliang in the Nihewan Basin, north China; these results place stringent controls on the age of Palaeolithic stone artifacts that were originally reported over two decades ago. Our palaeomagnetic findings indicate that the artifact layer resides in a reverse polarity magnetozone bounded by the Olduvai and Jaramillo subchrons. Coupled with an estimated rate of sedimentation, these findings constrain the layer's age to roughly 1.36 million years ago. This result represents the age of the oldest known stone assemblage comprising recognizable types of Palaeolithic tool in east Asia, and the earliest definite occupation in this region as far north as 40 degrees N.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Weilin Huang

Multi-Similarity Loss With General Pair Weighting for Deep Metric Learning

Detecting Text in Natural Image with Connectionist Text Proposal Network

Deep Metric Learning with Hierarchical Triplet Loss

Robust Scene Text Detection with Convolution Neural Network Induced MSER Trees

CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images

ClothFlow: A Flow-Based Model for Clothed Person Generation

Cross-Batch Memory for Embedding Learning

Earliest presence of humans in northeast Asia

Contact Info

Product

Resources

About