The key to an accurate understanding of terrain is to extract the informative features from the multi-modal data obtained from different devices. Sensors, such as RGB cameras, depth sensors, vibration sensors, and microphones, are used as the multi-modal data. Many studies have explored ways to use them, especially in the robotics field. Some papers have successfully introduced single-modal or multi-modal methods. However, in practice, robots can be faced with extreme conditions; microphones do not work well in crowded scenes, and an RGB camera cannot capture terrains well in the dark. In this paper, we present a novel framework using the multi-modal variational autoencoder and the Gaussian mixture model clustering algorithm on image data and audio data for terrain type clustering by forcing the features to be closer together in the feature space. Our method enables the terrain type clustering even if one of the modalities (either image or audio) is missing at the test-time. We evaluated the clustering accuracy with a conventional multi-modal terrain type clustering method and we conducted ablation studies to show the effectiveness of our approach. INDEX TERMSSelf-supervised, Terrain type clustering, Multi-modal learning Training Testing OR FIGURE 1: Overview of our terrain clustering framework.We train the model to extract the features from audio-visual data in a self-supervised manner. At the testing, we assume that only a single modality (either audio or visual) can be accessed due to the extreme conditions, the obtained data is incrementally clustered into terrain types.
The contactless estimation of the weight of a container and the amount of its content manipulated by a person are key pre-requisites for safe human-to-robot handovers. However, opaqueness and transparencies of the container and the content, and variability of materials, shapes, and sizes, make this problem challenging. In this paper, we present a range of methods and an open framework to benchmark acoustic and visual perception for the estimation of the capacity of a container, and the type, mass, and amount of its content. The framework includes a dataset, specific tasks and performance measures. We conduct a fair and in-depth comparative analysis of methods that used this framework and audio-only or vision-only baselines designed from related works. Based on this analysis, we can conclude that audioonly and audio-visual classifiers are suitable for the estimation of the type and amount of the content using different types of convolutional neural networks, combined with either recurrent neural networks or a majority voting strategy, whereas computer vision methods are suitable to determine the capacity of the container using regression and geometric approaches. Classifying the content type and level using only audio achieves a weighted average F1-score up to 81% and 97%, respectively. Estimating the container capacity with vision-only approaches and filling mass with audio-visual approaches, multi-stage algorithms reaches up to 65% weighted average capacity and mass scores. These results show that there is still room of improvement for the design of future methods that will be ranked and compared on the individual leaderboards provided by our open framework.
This study aimed to anticipate fractures of fragile food during robotic food manipulation. Anticipating fractures allows a robot to manipulate ingredients without irreversible failure. Food fracture models investigated in food texture fields explain the properties of fragile objects well. However, they may not directly apply to robot manipulation due to the variance in physical properties even within the same ingredient. To this end, we developed a fracture-anticipation system with a tactile sensing module and a simple recurrent neural network. The key idea was to allow the robot to break ingredients during training-sample collection. The timing of fractures was identified via simple signal processing and used for supervision. We performed real robot experiments with three typical fragile foods: tofu, potato chips, and bananas. As the first step toward flexible fragile-object manipulation, we evaluated the proposed method for the fundamental task of object picking. The method successfully grasped the fragile foods without fractures in an online demonstration. In an offline evaluation, the method predicted the fractures with a recall of approximately 80% for all ingredients with 60 breaking trials. We believe that our method can be used to avoid breakage in other types of food manipulation, e.g., holding, pressing, and rolling.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.