2021
DOI: 10.1007/978-3-030-68793-9_32
|View full text |Cite
|
Sign up to set email alerts
|

Audio-Visual Hybrid Approach for Filling Mass Estimation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
31
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
1

Relationship

1
6

Authors

Journals

citations
Cited by 12 publications
(31 citation statements)
references
References 6 publications
0
31
0
Order By: Relevance
“…We compare ACC with 9 alternative approaches, namely ResNet (18 layers) [16], a shallower ResNet variant (14 layers), a ResNet (18 layers) pre-trained on ImageNet (ResNet-18 ) [17] and fine-tuned on the training split of CCM, VGG (11 layers) [14], Support Vector Machine (SVM) [18], Random Forest [19], K-Nearest Neighbours (kNN) [20], and the top-2 submissions of the 2020 CORSMAL Challenge 1 , namely Because It's Tactile (BIT) [10], and HVRL [11]. SVM, kNN, Random Forest, VGG, and ResNet-based classifiers perform direct classification as a single model.…”
Section: A Methods Under Comparisonmentioning
confidence: 99%
See 1 more Smart Citation
“…We compare ACC with 9 alternative approaches, namely ResNet (18 layers) [16], a shallower ResNet variant (14 layers), a ResNet (18 layers) pre-trained on ImageNet (ResNet-18 ) [17] and fine-tuned on the training split of CCM, VGG (11 layers) [14], Support Vector Machine (SVM) [18], Random Forest [19], K-Nearest Neighbours (kNN) [20], and the top-2 submissions of the 2020 CORSMAL Challenge 1 , namely Because It's Tactile (BIT) [10], and HVRL [11]. SVM, kNN, Random Forest, VGG, and ResNet-based classifiers perform direct classification as a single model.…”
Section: A Methods Under Comparisonmentioning
confidence: 99%
“…HVRL [11] uses a VGG-like architecture to classify the content type from multiple sequential audio frames converted to spectrogram, followed by majority voting. For each audio frame, the feature map prior to the fully connected layers is processed by an LSTM-based recurrent neural network, followed by a fully connected layer, to temporally estimate the content level.…”
Section: A Methods Under Comparisonmentioning
confidence: 99%
“…combining adversarial training and transfer learning, can improve the classification accuracy [7]. Independent classification of content type and level can be achieved by using convolutional and recurrent neural networks with only audio as input data [4] or through late fusion of the predictions from both audio and visual features [5]. Alternatively, multiple multi-layer perceptrons can be trained with audio data and conditioned on the container category estimated from a majority voting of the object detection across the frames of multi-view sequences [6].…”
Section: Related Workmentioning
confidence: 99%
“…Alternatively, multiple multi-layer perceptrons can be trained with audio data and conditioned on the container category estimated from a majority voting of the object detection across the frames of multi-view sequences [6]. Container capacity can be estimated as an approximation of a reconstructed shape [4], [5], [33]. An iterative approach minimises a 3D primitive to the real object shape by constraining to the object segmentation mask from two views of a widebaseline stereo camera, using both RGB, depth, and infrared images [5].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation