Recent advancements in perception for autonomous driving are driven by deep learning. In order to achieve robust and accurate scene understanding, autonomous vehicles are usually equipped with different sensors (e.g. cameras, LiDARs, Radars), and multiple sensing modalities can be fused to exploit their complementary properties. In this context, many methods have been proposed for deep multi-modal perception problems. However, there is no general guideline for network architecture design, and questions of "what to fuse", "when to fuse", and "how to fuse" remain open. This review paper attempts to systematically summarize methodologies and discuss challenges for deep multi-modal object detection and semantic segmentation in autonomous driving. To this end, we first provide an overview of on-board sensors on test vehicles, open datasets, and background information for object detection and semantic segmentation in autonomous driving research. We then summarize the fusion methodologies and discuss challenges and open questions. In the appendix, we provide tables that summarize topics and methods. We also provide an interactive online platform to navigate each reference: https://boschresearch.github.io/multimodalperception/. 0.99 0.8 0.98 0.99 0.96 0.96 0.94 Vehicle Person Road sign Traffic light LiDAR Points Map Radar Points RGB Image
Please check the manuscript for details of any other licences that may have been applied and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. You may not engage in further distribution of the material for any profitmaking activities or any commercial gain. You may freely distribute both the url (http://uhra.herts.ac.uk/) and the content of this paper for research or private study, educational, or not-for-profit purposes without prior permission or charge. Take down policy If you believe that this document breaches copyright please contact us providing details, any such items will be temporarily removed from the repository pending investigation.
State-of-the-art, high capacity deep neural networks not only require large amounts of labelled training data, they are also highly susceptible to labelling errors in this data, typically resulting in large efforts and costs and therefore limiting the applicability of deep learning. To alleviate this issue, we propose a novel meta training and labelling scheme that is able to use inexpensive unlabelled data by taking advantage of the generalization power of deep neural networks. We show experimentally that by solely relying on one network architecture and our proposed scheme of iterative training and prediction steps, both label quality and resulting model accuracy, can be improved significantly. Our method achieves state-of-the-art results, while being architecture agnostic and therefore broadly applicable. Compared to other methods dealing with erroneous labels, our approach does neither require another network to be trained, nor does it necessarily need an additional, highly accurate reference label set. Instead of removing samples from a labelled set, our technique uses additional sensor data without the need for manual labelling.
Rawande Karadaghi, Heinz Hertlein, and, Aladdin Ariyaeeinia, "Open-set speaker identification with diverse-duration speech data", in Proceedings of SPIE 9457, Biometric and Surveillance Technology for Human Activity Identification XII, Baltimore, USA, 20 April 2015. DOI:10.1117/12.2176335?? 2015 SPIE.The concern in this paper is an important category of applications of open-set speaker identification in criminal investigation, which involves operating with short and varied duration speech. The study presents investigations into the adverse effects of such an operating condition on the accuracy of open-set speaker identification, based on both GMMUBM and i-vector approaches. The experiments are conducted using a protocol developed for the identification task, based on the NIST speaker recognition evaluation corpus of 2008. In order to closely cover the real-world operating conditions in the considered application area, the study includes experiments with various combinations of training and testing data duration. The paper details the characteristics of the experimental investigations conducted and provides a thorough analysis of the results obtained
An important consideration for the deployment of speaker recognition in authentication applications is the approach to the formation of training and testing utterances. Whilst defining this for a specific scenario is influenced by the associated requirements and conditions, the process can be further guided through the establishment of the relative usefulness of alternative frameworks for composing the training and testing material. In this regard, the present paper provides an analysis of the effects, on the speaker recognition accuracy, of various bases for the formation of the training and testing data. The experimental investigations are conducted based on the use of digit utterances taken from the XM2VTS database. The paper presents a detailed description of the individual approaches considered and discusses the experimental results obtained in different cases.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.