In the context of autonomous navigation, the localization of the vehicle relies on the accurate detection and tracking of artificial landmarks. These landmarks are based on handcrafted features. However, because of their lowlevel nature, they are not informative but also not robust under various conditions (lightning, weather, point-of-view). Moreover, in Advanced Driver-Assistance Systems (ADAS), and road safety, intense efforts have been made to implement automatic visual data processing, with special emphasis on road object recognition. The main idea of this work is to detect accurate higher-level landmarks such as static semantic objects using Deep learning frameworks. We mainly focus on the accurate detection, segmentation and classification of vertical traffic signs according to their function (danger, give way, prohibition/obligation, and indication). This paper presents the boundary estimation of European traffic signs from an embedded monocular camera in a vehicle. We propose a framework using two different deep neural networks in order to: (1) detect and recognize traffic signs in the video flow and (2) regress the coordinates of each vertices of the detected traffic sign to estimate its shape boundary. We also provide a comparison of our method with Mask R-CNN [1] which is the state-of-the-art segmentation method.
Existing methods for video instance segmentation (VIS) mostly rely on two strategies: (1) building a sophisticated post-processing to associate frame level segmentation results and (2) modeling a video clip as a 3D spatial-temporal volume with a limit of resolution and length due to memory constraints. In this work, we propose a frame-to-frame method built upon transformers. We use a set of queries, called instance sequence queries (ISQs), to drive the transformer decoder and produce results at each frame. Each query represents one instance in a video clip. By extending the bipartite matching loss to two frames, our training procedure enables the decoder to adjust the ISQs during inference. The consistency of instances is preserved by the corresponding order between query slots and network outputs. As a result, there is no need for complex data association. On TITAN Xp GPU, our method achieves a competitive 34.4% mAP at 33.5 FPS with ResNet-50 and 35.5% mAP at 26.6 FPS with ResNet-101 on the Youtube-VIS dataset.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.