The paper presents a learning-based method for computing a discriminative 3D point cloud descriptor for place recognition purposes. Existing methods, such as Point-NetVLAD, are based on unordered point cloud representation. They use PointNet as the first processing step to extract local features, which are later aggregated into a global descriptor. The PointNet architecture is not well suited to capture local geometric structures. Thus, state-of-the-art methods enhance vanilla PointNet architecture by adding different mechanism to capture local contextual information, such as graph convolutional networks or using hand-crafted features. We present an alternative approach, dubbed Min-kLoc3D, to compute a discriminative 3D point cloud descriptor, based on a sparse voxelized point cloud representation and sparse 3D convolutions. The proposed method has a simple and efficient architecture. Evaluation on standard benchmarks proves that MinkLoc3D outperforms current state-of-the-art. Our code is publicly available on the project website. 1
The paper describes a deep neural network-based detector dedicated for ball and players detection in high resolution, long shot, video recordings of soccer matches. The detector, dubbed FootAndBall, has an efficient fully convolutional architecture and can operate on input video stream with an arbitrary resolution. It produces ball confidence map encoding the position of the detected ball, player confidence map and player bounding boxes tensor encoding players' positions and bounding boxes. The network uses Feature Pyramid Network desing pattern, where lower level features with higher spatial resolution are combined with higher level features with bigger receptive field. This improves discriminability of small objects (the ball) as larger visual context around the object of interest is taken into account for the classification. Due to its specialized design, the network has two orders of magnitude less parameters than a generic deep neural network-based object detector, such as SSD or YOLO. This allows real-time processing of high resolution input video stream.
The paper describes a deep network based object detector specialized for ball detection in long shot videos. Due to its fully convolutional design, the method operates on images of any size and produces ball confidence map encoding the position of detected ball. The network uses hypercolumn concept, where feature maps from different hierarchy levels of the deep convolutional network are combined and jointly fed to the convolutional classification layer. This allows boosting the detection accuracy as larger visual context around the object of interest is taken into account. The method achieves state-of-the-art results when tested on publicly available ISSIA-CNR Soccer Dataset.
In the recent years, a number of novel, deep-learning based, interest point detectors, such as LIFT, DELF, Superpoint or LF-Net was proposed. However there's a lack of a standard benchmark to evaluate suitability of these novel keypoint detectors for real-live applications such as autonomous driving. Traditional benchmarks (e.g. Oxford VGG) are rather limited, as they consist of relatively few images of mostly planar scenes taken in favourable conditions. In this paper we verify if the recent, deep-learning based interest point detectors have the advantage over the traditional, hand-crafted keypoint detectors. To this end, we evaluate stability of a number of hand crafted and recent, learning-based interest point detectors on the street-level view ApolloScape dataset.
We introduce a discriminative multimodal descriptor based on a pair of sensor readings: a point cloud from a LiDAR and an image from an RGB camera. Our descriptor, named MinkLoc++, can be used for place recognition, relocalization and loop closure purposes in robotics or autonomous vehicles applications. We use late fusion approach, where each modality is processed separately and fused in the final part of the processing pipeline. The proposed method achieves state-ofthe-art performance on standard place recognition benchmarks. We also identify dominating modality problem when training a multimodal descriptor. The problem manifests itself when the network focuses on a modality with a larger overfit to the training data. This drives the loss down during the training but leads to suboptimal performance on the evaluation set. In this work we describe how to detect and mitigate such risk when using a deep metric learning approach to train a multimodal neural network. Our code is publicly available on the project website. 1
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.