Autonomous micro aerial vehicles (MAVs) have cost and mobility benefits, making them ideal robotic platforms for applications including aerial photography, surveillance, and search and rescue. As the platform scales down, MAVs become more capable of operating in confined environments, but it also introduces significant size and payload constraints. A monocular visual-inertial navigation system (VINS), consisting only of an inertial measurement unit (IMU) and a camera, becomes the most suitable sensor suite in this case, thanks to its light weight and small footprint.In fact, it is the minimum sensor suite allowing autonomous flight with sufficient environmental awareness. In this paper, we show that it is possible to achieve reliable online autonomous navigation using monocular VINS. Our system is built on a customized quadrotor testbed equipped with a fisheye camera, a low-cost IMU, and heterogeneous onboard computing resources. The backbone of our system is a highly accurate optimization-based monocular visual-inertial state estimator with online initialization and self-extrinsic calibration. An onboard GPU-based monocular dense mapping module that conditions on the estimated pose provides wide-angle situational awareness. Finally, an online trajectory planner that operates directly on the incrementally built threedimensional map guarantees safe navigation through cluttered environments. Extensive experimental results are provided to validate individual system modules as well as the overall performance in both indoor and outdoor environments.
There have been increasing demands for developing microaerial vehicles with vision-based autonomy for search and rescue missions in complex environments. In particular, the monocular visual-inertial system (VINS), which consists of only an inertial measurement unit (IMU) and a camera, forms a great lightweight sensor suite due to its low weight and small footprint. In this paper, we address two challenges for rapid deployment of monocular VINS: 1) the initialization problem and 2) the calibration problem. We propose a methodology that is able to initialize velocity, gravity, visual scale, and camera-IMU extrinsic calibration on the fly. Our approach operates in natural environments and does not use any artificial markers. It also does not require any prior knowledge about the mechanical configuration of the system. It is a significant step toward plugand-play and highly customizable visual navigation for mobile robots. We show through online experiments that our method leads to accurate calibration of camera-IMU transformation, with errors less than 0.02 m in translation and 1°in rotation. We compare out method with a state-of-the-art marker-based offline calibration method and show superior results. We also demonstrate the performance of the proposed approach in largescale indoor and outdoor experiments.Note to Practitioners-This paper presents a methodology for online state estimation in natural environments using only a camera and a low-cost micro-electro-mechanical systems (MEMS) IMU. It focuses on addressing the problems of online estimator initialization, sensor extrinsic calibration, and nonlinear optimization with online refinement of calibration parameters. This paper is particularly useful for applications that have superior size, weight, and power constraints. It aims for rapid deployment of robot platforms with robust state estimation capabilities with ). This paper has supplementary downloadable multimedia material available at http://ieeexplore.ieee.org provided by the authors. The Supplementary Material contains the following. Three experiments are presented in the video to demonstrate the performance of our self-calibrating monocular visualinertial state estimation method. The first experiment details the camera-IMU extrinsic calibration process in a small indoor experiment. The second experiment evaluates the performance of the overall system in a large-scale indoor environment with highlights to the online calibration process. The third experiment presents the state estimation results in a large-scale outdoor environment using different camera configurations. This material is 52.6 MB in size. almost no setup, calibration, or initialization overhead. The proposed method can be used in platforms including handheld devices, aerial robots, and other small-scale mobile platforms, with applications in monitoring, inspection, and search and rescue.
Recently, we have witnessed a significant performance boosting for dialogue response selection task achieved by Cross-Encoder based models. However, such models directly feed the concatenation of context and response into the pre-trained model for interactive inference, ignoring the comprehensively independent representation modeling of context and response. Moreover, randomly sampling negative responses from other dialogue contexts is simplistic, and the learned models have poor generalization capability in realistic scenarios. In this paper, we propose a response selection model called BERT-BC that combines the representation-based Bi-Encoder and interaction-based Cross-Encoder. Three contrastive learning methods are designed for the Bi-Encoder to align context and response to obtain the better semantic representation.Meanwhile, according to the alignment difficulty of context and response semantics, the harder samples are dynamically selected from the same batch with minimal cost and sent to Cross-Encoder to enhance the model's interactive reasoning ability.Experimental results show that BERT-BC can achieve state-of-the-art performance on three benchmark datasets for multi-turn response selection.
Inferring the sentiment polarity or emotion category of subjective text is the fundamental task of sentiment analysis. Recently, emotion detection in conversations that considering context utterances has emerged as a very important and challenging task in this line of research. Most existing studies do not distinguish different speakers in a dialog and fail to characterize inter-speaker dependencies for emotion detection. In this paper, we propose a Speaker Influence aware Neural Network model (dubbed as SINN) to predict the emotion of the last utterance in a conversation, which explicitly models the self and inter-speaker influences of historical utterances with GRUs (Gated Recurrent Units) and hierarchical attention matching network. Moreover, the empathy phenomenon is also considered by an emotion state tracking component in SINN. Finally, the target utterance representation is enhanced by speaker influence aware context modeling, where an attention mechanism is used to extract the most relevant features for emotion classification. We construct a largescale multi-turn Chinese dialog dataset WBEmoDialog, where each utterance is manually annotated with an emotion label. Extensive experiments are conducted on public available DailyDialog dataset as well as our constructed WBEmoDialog dataset, and the results show that our model can achieve better or comparable performance with the strong baseline methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.