For many computer vision applications, such as image description and human identification, recognizing the visual attributes of humans is an essential yet challenging problem. Its challenges originate from its multi-label nature, the large underlying class imbalance and the lack of spatial annotations. Existing methods follow either a computer vision approach while failing to account for class imbalance, or explore machine learning solutions, which disregard the spatial and semantic relations that exist in the images. With that in mind, we propose an effective method that extracts and aggregates visual attention masks at different scales. We introduce a loss function to handle class imbalance both at class and at an instance level and further demonstrate that penalizing attention masks with high prediction variance accounts for the weak supervision of the attention mechanism. By identifying and addressing these challenges, we achieve state-of-the-art results with a simple attention mechanism in both PETA and WIDER-Attribute datasets without additional context or side information.
Deep neural networks often require copious amount of labeled-data to train their scads of parameters. Training larger and deeper networks is hard without appropriate regularization, particularly while using a small dataset. Laterally, collecting well-annotated data is expensive, timeconsuming and often infeasible. A popular way to regularize these networks is to simply train the network with more data from an alternate representative dataset. This can lead to adverse effects if the statistics of the representative dataset are dissimilar to our target. This predicament is due to the problem of domain shift. Data from a shifted domain might not produce bespoke features when a feature extractor from the representative domain is used. In this paper, we propose a new technique (d-SNE) of domain adaptation that cleverly uses stochastic neighborhood embedding techniques and a novel modified-Hausdorff distance. The proposed technique is learnable end-to-end and is therefore, ideally suited to train neural networks. Extensive experiments demonstrate that d-SNE outperforms the current states-of-the-art and is robust to the variances in different datasets, even in the one-shot and semi-supervised learning settings. d-SNE also demonstrates the ability to generalize to multiple domains concurrently.
We propose to detect Deepfake generated by face manipulation based on one of their fundamental features: images are blended by patches from multiple sources, carrying distinct and persistent source features. In particular, we propose a novel representation learning approach for this task, called patch-wise consistency learning (PCL). It learns by measuring the consistency of image source features, resulting to representation with good interpretability and robustness to multiple forgery methods. We develop an inconsistency image generator (I2G) to generate training data for PCL and boost its robustness. We evaluate our approach on seven popular Deepfake detection datasets. Our model achieves superior detection accuracy and generalizes well to unseen generation methods. On average, our model outperforms the state-of-the-art in terms of AUC by 2% and 8% in the in-and cross-dataset evaluation, respectively.
For many computer vision applications such as image captioning, visual question answering, and person search, learning discriminative feature representations at both image and text level is an essential yet challenging problem. Its challenges originate from the large word variance in the text domain as well as the difficulty of accurately measuring the distance between the features of the two modalities. Most prior work focuses on the latter challenge, by introducing loss functions that help the network learn better feature representations but fail to account for the complexity of the textual input. With that in mind, we introduce TIMAM: a Text-Image Modality Adversarial Matching approach that learns modality-invariant feature representations using adversarial and cross-modal matching objectives. In addition, we demonstrate that BERT, a publiclyavailable language model that extracts word embeddings, can successfully be applied in the text-to-image matching domain. The proposed approach achieves state-of-theart cross-modal matching performance on four widely-used publicly-available datasets resulting in absolute improvements ranging from 2% to 5% in terms of rank-1 accuracy.
To demonstrate the repeatability of fast 3D T 1 mapping using Magnetization-Prepared Golden-angle RAdial Sparse Parallel (MP-GRASP) MRI and its robustness to variation of imaging parameters including flip angle and spatial resolution in phantoms and the brain. Theory and Methods:Multiple imaging experiments were performed to (1) assess the robustness of MP-GRASP T 1 mapping to B 1 inhomogeneity using a single tube phantom filled with uniform MnCl 2 liquid; (2) compare the repeatability of T 1 mapping between MP-GRASP and inversion recovery-based spin-echo (IR-SE; over 12 scans), using a commercial T1MES phantom; (3) evaluate the longitudinal variation of T 1 estimation using MP-GRASP with varying imaging parameters, including spatial resolution, flip angle, TR/TE, and acceleration rate, using the T1MES phantom (106 scans performed over a period of 12 months); and (4) evaluate the variation of T 1 estimation using MP-GRASP with varying imaging parameters in the brain (24 scans in a single visit). In addition, the accuracy of MP-GRASP T 1 mapping was also validated against IR-SE by performing linear correlation and calculating the Lin's concordance correlation coefficient (CCC).Results: MP-GRASP demonstrates good robustness to B 1 inhomogeneity, with intra-slice variability below 1% in the single tube phantom experiment. The longitudinal variability is good both in the phantom (below 2.5%) and in the brain (below 2%) with varying imaging parameters. The T 1 values estimated from MP-GRASP are accurate compared to that from the IR-SE imaging (R 2 = 0.997, Lin's CCC = 0.996). Conclusion: MP-GRASP shows excellent repeatability of T 1 estimation over time, andit is also robust to variation of different imaging parameters evaluated in this study.
In recent years, face detection has experienced significant performance improvement with the boost of deep convolutional neural networks. In this report, we reimplement the state-of-the-art detector [1] and apply some tricks proposed in the recent literatures to obtain an extremely strong face detector, named VIM-FD. In specific, we exploit more powerful backbone network like DenseNet-121 [2], revisit the data augmentation based on data-anchor-sampling proposed in [3], and use the max-in-out label and anchor matching strategy in [4]. In addition, we also introduce the attention mechanism [5,6] to provide additional supervision. Over the most popular and challenging face detection benchmark, i.e., WIDER FACE [7], the proposed VIM-FD achieves state-of-the-art performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.