Object counting is an important task in computer vision due to its growing demand in applications such as surveillance, traffic monitoring, and counting everyday objects. State-of-the-art methods use regression-based optimization where they explicitly learn to count the objects of interest. These often perform better than detection-based methods that need to learn the more difficult task of predicting the location, size, and shape of each object. However, we propose a detectionbased method that does not need to estimate the size and shape of the objects and that outperforms regression-based methods. Our contributions are three-fold: (1) we propose a novel loss function that encourages the network to output a single blob per object instance using pointlevel annotations only; (2) we design two methods for splitting large predicted blobs between object instances; and (3) we show that our method achieves new state-of-the-art results on several challenging datasets including the Pascal VOC and the Penguins dataset. Our method even outperforms those that use stronger supervision such as depth features, multi-point annotations, and bounding-box labels.
Neural networks are prone to adversarial attacks. In general, such attacks deteriorate the quality of the input by either slightly modifying most of its pixels, or by occluding it with a patch. In this paper, we propose a method that keeps the image unchanged and only adds an adversarial framing on the border of the image. We show empirically that our method is able to successfully attack state-of-theart methods on both image and video classification problems. Notably, the proposed method results in a universal attack which is very fast at test time. Source code can be found at github.com/zajaczajac/adv_framing. * Equal contribution † ul.
Despite much effort in the community, the momentum of An-droid research has not yet produced complete tools to perform thorough analysis on Android apps, leaving users vulnerable to malicious apps. Because it is hard for a single tool to efficiently address all of the various challenges of Android programming which make analysis difficult, we propose to instrument the app code for reducing the analysis complexity , e.g., transforming a hard problem to a easy-resolvable one. To this end, we introduce in this paper Apkpler, a plugin-based framework for supporting such instrumenta-tion. We evaluate Apkpler with two plugins, demonstrating the feasibility of our approach and showing that Apkpler can indeed be leveraged to reduce the analysis complexity of Android apps.
Single-view 3D shape reconstruction is an important but challenging problem, mainly for two reasons. First, as shape annotation is very expensive to acquire, current methods rely on synthetic data, in which ground-truth 3D annotation is easy to obtain. However, this results in domain adaptation problem when applied to natural images. The second challenge is that there are multiple shapes that can explain a given 2D image. In this paper, we propose a framework to improve over these challenges using adversarial training. On one hand, we impose domain confusion between natural and synthetic image representations to reduce the distribution gap. On the other hand, we impose the reconstruction to be 'realistic' by forcing it to lie on a (learned) manifold of realistic object shapes. Our experiments show that these constraints improve performance by a large margin over baseline reconstruction models. We achieve results competitive with the state of the art with a much simpler architecture. arXiv:1812.01742v2 [cs.CV] 26 Aug 2019 D ⇤ < l a t e x i t s h a 1 _ b a s e 6 4 = " E N j 9 Z j m +4 Y 2 n 5 I N w V s 8 e R k a 5 1 X P 8 v 1 F p X a d x 1 G E I z i G U / D g E m p w B 3 X w g Y G A Z 3 i F N 0 c 5 L 8 6 7 8 z F v L T j 5 z C H 8 k f P 5 A 3 R + j n A = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " E N j 9 Z j m + T k V V E T 1 1 D D a H Y T y V I b 4 = " > A A A B 7 H i c b Z B N S w M x E I Z n 6 1 e t X 1 W P X o J F E A 9 l V w Q 9 F v X g s Y L b F t q 1 Z N O 0 D c 1 m l 2 R W K E t / g x c P i n j 1 B 3 n z 3 5 i 2 e 9 D W F w I P 7 8 y4 Y 2 n 5 I N w V s 8 e R k a 5 1 X P 8 v 1 F p X a d x 1 G E I z i G U / D g E m p w B 3 X w g Y G A Z 3 i F N 0 c 5 L 8 6 7 8 z F v L T j 5 z C H 8 k f P 5 A 3 R + j n A = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " E N j 9 Z j m + T k V V E T 1 1 D D a H Y T y V I b 4 = " > A A A B 7 H i c b Z B N S w M x E I Z n 6 1 e t X 1 W P X o J F E A 9 l V w Q 9 F v X g s Y L b F t q 1 Z N O 0 D c 1 m l 2 R W K E t / g x c P i n j 1 B 3 n z 3 5 i 2 e 9 D W F w I P 7 8 y Q m T d M p D D o u t 9 O Y W V 1 b X 2 j u F n a 2 t 7 Z 3 S v v H z R M n G r G f R b L W L d C a r g U i v s o U P J W o j m N Q s m b 4 e h m W m 8 + c W 1 E r B 5 w n P A g o g M l + o J R t J Z / + 5 i d T b r l i l t 1 Z y L L 4 O V Q g V z 1 b v m r 0 4 t Z G n G F T F J j 2 p 6 b Y J B R j Y J J P i l 1 U s M T y k Z 0 w N s W F Y 2 4 C b L Z s h N y Y p 0 e 6 c f a P o V k 5 v 6 e y G h k z D g K b W d E c W g W a 1 P z v 1 o 7 x f 5 V k A m V p M g V m 3 / U T y X B m E w v J z 2 h O U M 5 t k C Z F n Z X w o Z U U 4 Y 2 n 5 I N w V s 8 e R k a 5 1 X P 8 v 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " T U W p 3 8 T Z M v t V 4 w U U P s 0 X W K Z Y c S Q = " > A A A B 7 n i c b Z B N S w M x E I Z n 6 1 e t X 1 W P X o J F 8 F R 2 R d B j U Q 8 e K 9 g P a J e S T a d t a J J d k q x Q l v 4 I L x 4 U 8 e r v 8 e a / M W 3 3 o K 0 v B B 7 e m S E z b 5 Q I b q z v f 3 u F t f W N z a 3 i d m l n d 2 / / o H x 4 1 D R x q h k 2 W C x i 3 Y 6 o Q c E V N i y 3 A t u J R i o j g...
Metric-based meta-learning techniques have successfully been applied to fewshot classification problems. In this paper, we propose to leverage cross-modal information to enhance metric-based few-shot learning methods. Visual and semantic feature spaces have different structures by definition. For certain concepts, visual features might be richer and more discriminative than text ones. While for others, the inverse might be true. Moreover, when the support from visual information is limited in image classification, semantic representations (learned from unsupervised text corpora) can provide strong prior knowledge and context to help learning. Based on these two intuitions, we propose a mechanism that can adaptively combine information from both modalities according to new image categories to be learned. Through a series of experiments, we show that by this adaptive combination of the two modalities, our model outperforms current uni-modality few-shot learning methods and modality-alignment methods by a large margin on all benchmarks and few-shot scenarios tested. Experiments also show that our model can effectively adjust its focus on the two modalities. The improvement in performance is particularly large when the number of shots is very small.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.