One of the main challenges in machine vision relates to the problem of obtaining robust representation of visual features that remain unaffected by geometric transformations. This challenge arises naturally in many practical machine vision tasks. For example, in mobile robot applications like simultaneous localization and mapping (SLAM) and visual tracking, object shapes change depending on their orientation in the 3D world, camera proximity, viewpoint, or perspective. In addition, natural phenomena such as occlusion, deformation, and clutter can cause geometric appearance changes of the underlying objects, leading to geometric transformations of the resulting images. Recently, deep learning techniques have proven very successful in visual recognition tasks but they typically perform poorly with small data or when deployed in environments that deviate from training conditions. While convolutional neural networks (CNNs) have inherent representation power that provides a high degree of invariance to geometric image transformations, they are unable to satisfactorily handle nontrivial transformations. In view of this limitation, several techniques have been devised to extend CNNs to handle these situations. This article reviews some of the most promising approaches to extend CNN architectures to handle nontrivial geometric transformations. Key strengths and weaknesses, as well as the application domains of the various approaches are also highlighted. The review shows that although an adequate model for generalized geometric transformations has not yet been formulated, several techniques exist for solving specific problems. Using these methods, it is possible to develop task-oriented solutions to deal with nontrivial transformations.
Grounding DINO and the Segment Anything Model (SAM) have achieved impressive performance in zero-shot object detection and image segmentation, respectively. Together, they have a great potential to revolutionize applications in zero-shot semantic segmentation or data annotation. Yet, in specialized domains like medical image segmentation, objects of interest (e.g., organs, tissues, and tumors) may not fall in existing class names. To address this problem, the referring expression comprehension (REC) ability of Grounding DINO is leveraged to detect arbitrary targets by their language descriptions. However, recent studies have highlighted severe limitation of the REC framework in this application setting owing to its tendency to make false positive predictions when the target is absent in the given image. And, while this bottleneck is central to the prospect of open-set semantic segmentation, it is still largely unknown how much improvement can be achieved by studying the prediction errors. To this end, we perform empirical studies on six publicly available datasets across different domains and reveal that these errors consistently follow a predictable pattern and can, thus, be mitigated by a simple strategy. Specifically, we show that false positive detections with appreciable confidence scores generally occupy large image areas and can usually be filtered by their relative sizes. More importantly, we expect these observations to inspire future research in improving REC-based detection and automated segmentation. Meanwhile, we evaluate the performance of SAM on multiple datasets from various specialized domains and report significant improvements in segmentation performance and annotation time savings over manual approaches.
In recent years, monocular depth estimation (MDE) has witnessed a substantial performance improvement due to convolutional neural networks (CNNs). However, CNNs are vulnerable to adversarial attacks, which pose serious concerns for safety-critical and security-sensitive systems. Specifically, adversarial attacks can have catastrophic impact on MDE given its importance for scene understanding in applications like autonomous driving and robotic navigation. To physically assess the vulnerability of CNN-based depth prediction methods, recent work tries to design adversarial patches against MDE. However, these methods are not powerful enough to fully fool the vision system in a systemically threatening manner. In fact, their impact is partial and locally limited; they mislead the depth prediction of only the overlapping region with the input image regardless of the target object size, shape and location. In this paper, we investigate MDE vulnerability to adversarial patches in a more comprehensive manner. We propose a novel adaptive adversarial patch (APARATE 1 ) that is able to selectively jeopardize MDE by either corrupting the estimated distance, or simply manifesting an object as disappeared for the autonomous system. Specifically, APARATE is optimized to be shape and scale-aware, and its impact adapts to the target object instead of being limited to the immediate neighborhood. Our proposed patch achieves more than 14 meters mean depth estimation error, with 99% of the target region being affected. We believe this work highlights the threat of adversarial attacks in the context of MDE, and we hope it would alert the community to the real-life potential harm of this attack and motivate investigating more robust and adaptive defenses for autonomous robots.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.