Top-1 CORSMAL Challenge 2020 Submission: Filling Mass Estimation Using Multi-modal Observations of Human-Robot Handovers

Iashin, Vladimir; Palermo, Francesca; Solak, Gokhan; Coppola, Claudio

doi:10.1007/978-3-030-68793-9_31

Cited by 12 publications

(29 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While each property (content type, content level) could be classified independently [10]- [12], the combination of the two predictions can result in a wrong classification, if either is incorrect. We thus define a set of seven classes that combine content types and levels, C = {empty, pasta-half-full, pasta-full, rice-half-full, rice-full, water-half-full, water-full} (see Tab.…”

Section: Proposed Methodsmentioning

confidence: 99%

“…We compare ACC with 9 alternative approaches, namely ResNet (18 layers) [16], a shallower ResNet variant (14 layers), a ResNet (18 layers) pre-trained on ImageNet (ResNet-18 ) [17] and fine-tuned on the training split of CCM, VGG (11 layers) [14], Support Vector Machine (SVM) [18], Random Forest [19], K-Nearest Neighbours (kNN) [20], and the top-2 submissions of the 2020 CORSMAL Challenge 1 , namely Because It's Tactile (BIT) [10], and HVRL [11]. SVM, kNN, Random Forest, VGG, and ResNet-based classifiers perform direct classification as a single model.…”

Section: A Methods Under Comparisonmentioning

confidence: 99%

“…BIT [10] classifies content level and type independently with a late fusion of audio and RGB data. Audio features provided as input to a Random Forest classifier are Mel-frequency cepstrum coefficients, spectral characteristics, zero crossing rate, chroma vector and deviation; and audio features generated by a VGG-like architecture are provided as input to a GRU-based recurrent neural network.…”

Section: A Methods Under Comparisonmentioning

confidence: 99%

See 2 more Smart Citations

Audio classification of the content of food containers and drinking glasses

Donaher¹,

Xompero²,

Cavallaro³

2021

Preprint

View full text Add to dashboard Cite

Food containers, drinking glasses and cups handled by a person generate sounds that vary with the type and amount of their content. In this paper, we propose a new model for sound-based classification of the type and amount of content in a container. The proposed model is based on the decomposition of the problem into two steps, namely action recognition and content classification. We consider the scenario of the recent CORSMAL Containers Manipulation dataset and consider two actions (shaking and pouring), and seven material and filling level combinations. The first step identifies the action a person performs while manipulating a container. The second step is an appropriate classifier trained for the specific interaction identified by the first step to classify the amount and type of content. Experiments show that the proposed model achieves 76.02, 78.24, and 41.89 weighted average F1 score on the three test sets, respectively, and outperforms baselines and existing approaches that classify either independently content level and content type or directly the combination of content type and level together.

show abstract

Section: Proposed Methodsmentioning

confidence: 99%

Section: A Methods Under Comparisonmentioning

confidence: 99%

Section: A Methods Under Comparisonmentioning

confidence: 99%

See 1 more Smart Citation

Audio classification of the content of food containers and drinking glasses

Donaher¹,

Xompero²,

Cavallaro³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…combining adversarial training and transfer learning, can improve the classification accuracy [7]. Independent classification of content type and level can be achieved by using convolutional and recurrent neural networks with only audio as input data [4] or through late fusion of the predictions from both audio and visual features [5]. Alternatively, multiple multi-layer perceptrons can be trained with audio data and conditioned on the container category estimated from a majority voting of the object detection across the frames of multi-view sequences [6].…”

Section: Related Workmentioning

confidence: 99%

“…Alternatively, multiple multi-layer perceptrons can be trained with audio data and conditioned on the container category estimated from a majority voting of the object detection across the frames of multi-view sequences [6]. Container capacity can be estimated as an approximation of a reconstructed shape [4], [5], [33]. An iterative approach minimises a 3D primitive to the real object shape by constraining to the object segmentation mask from two views of a widebaseline stereo camera, using both RGB, depth, and infrared images [5].…”

Section: Related Workmentioning

confidence: 99%

Towards safe human-to-robot handovers of unknown containers

Pang¹,

Xompero²,

Oh³

et al. 2021

Preprint

View full text Add to dashboard Cite

Safe human-to-robot handovers of unknown objects require accurate estimation of hand poses and object properties, such as shape, trajectory, and weight. Accurately estimating these properties requires the use of scanned 3D object models or expensive equipment, such as motion capture systems and markers, or both. However, testing handover algorithms with robots may be dangerous for the human and, when the object is an open container with liquids, for the robot. In this paper, we propose a real-to-simulation framework to develop safe human-to-robot handovers with estimations of the physical properties of unknown cups or drinking glasses and estimations of the human hands from videos of a human manipulating the container. We complete the handover in simulation, and we estimate a region that is not occluded by the hand of the human holding the container. We also quantify the safeness of the human and object in simulation. We validate the framework using public recordings of containers manipulated before a handover and show the safeness of the handover when using noisy estimates from a range of perceptual algorithms.

show abstract

The CORSMAL Benchmark for the Prediction of the Properties of Containers

et al. 2022

Self Cite

View full text Add to dashboard Cite

The contactless estimation of the weight of a container and the amount of its content manipulated by a person are key pre-requisites for safe human-to-robot handovers. However, opaqueness and transparencies of the container and the content, and variability of materials, shapes, and sizes, make this problem challenging. In this paper, we present a range of methods and an open framework to benchmark acoustic and visual perception for the estimation of the capacity of a container, and the type, mass, and amount of its content. The framework includes a dataset, specific tasks and performance measures. We conduct a fair and in-depth comparative analysis of methods that used this framework and audio-only or vision-only baselines designed from related works. Based on this analysis, we can conclude that audioonly and audio-visual classifiers are suitable for the estimation of the type and amount of the content using different types of convolutional neural networks, combined with either recurrent neural networks or a majority voting strategy, whereas computer vision methods are suitable to determine the capacity of the container using regression and geometric approaches. Classifying the content type and level using only audio achieves a weighted average F1-score up to 81% and 97%, respectively. Estimating the container capacity with vision-only approaches and filling mass with audio-visual approaches, multi-stage algorithms reaches up to 65% weighted average capacity and mass scores. These results show that there is still room of improvement for the design of future methods that will be ranked and compared on the individual leaderboards provided by our open framework.

show abstract

Top-1 CORSMAL Challenge 2020 Submission: Filling Mass Estimation Using Multi-modal Observations of Human-Robot Handovers

Cited by 12 publications

References 26 publications

Audio classification of the content of food containers and drinking glasses

Audio classification of the content of food containers and drinking glasses

Towards safe human-to-robot handovers of unknown containers

The CORSMAL Benchmark for the Prediction of the Properties of Containers

Contact Info

Product

Resources

About