2021
DOI: 10.48550/arxiv.2104.04687
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Learning from 2D: Contrastive Pixel-to-Point Knowledge Transfer for 3D Pretraining

Abstract: Most of the 3D networks are trained from scratch owning to the lack of large-scale labeled datasets. In this paper, we present a novel 3D pretraining method by leveraging 2D networks learned from rich 2D datasets. We propose the pixel-to-point knowledge transfer to effectively utilize the 2D information by mapping the pixel-level and point-level features into the same embedding space. Due to the heterogeneous nature between 2D and 3D networks, we introduce the back-projection function to align the features bet… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
21
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 8 publications
(25 citation statements)
references
References 63 publications
0
21
0
Order By: Relevance
“…However, there is no intra-modal constraint on point clouds, which can lead to trivial solutions and limit performance improvement. On the contrary, (Liu et al 2021) propose a method to imbue the image prior to the 3D representation. All these methods motivate us to further explore multimodal pre-training strategies for more challenging outdoor multi-modal data.…”
Section: Multi-modal Representation Learningmentioning
confidence: 99%
See 1 more Smart Citation
“…However, there is no intra-modal constraint on point clouds, which can lead to trivial solutions and limit performance improvement. On the contrary, (Liu et al 2021) propose a method to imbue the image prior to the 3D representation. All these methods motivate us to further explore multimodal pre-training strategies for more challenging outdoor multi-modal data.…”
Section: Multi-modal Representation Learningmentioning
confidence: 99%
“…Compared with methods that focus on the indoor RGB-D data (Hou et al 2021b;Liu et al 2021), which utilize a U-Net (Ronneberger, Fischer, and Brox 2015) shape backbone to align the dense/local feature extracted by point-cloud feature extractor, our method only adopts a standard ResNet-50 encoder to extract the global features. Since the modules apply to global representations, alignment is achieved on the fly.…”
Section: Inter-modal Feature Interaction Modulementioning
confidence: 99%
“…On a related note, for video as a 2D+1D data, TimeSFormer [4] proposes an inflated design from 2D transformers, plus memorizing information across frames using another transformer along the additional time dimension. [24] provides a pixel-topoint knowledge distillation by contrastive learning. It is also possible to apply a uniform transformer backbone in different data modalities, including 2D and 3D images, which is successfully shown by Preceiver [19] and Omnivore [15].…”
Section: Transferring Knowledge Between 2d and 3dmentioning
confidence: 99%
“…For example, xMUDA [58] utilizes aligned images and point-clouds to transfer 2D feature map information for 3D semantic segmentation through knowledge distillation [59]. For cross-modal transfer learning [60], Liu et al [61] proposed pixel-topoint knowledge transfer (PPKT) from 2D to 3D which uses aligned RGB and RGB-D images during pretraining. Our work does not rely on joint image-point-cloud pretraining.…”
Section: Cross-modal Learningmentioning
confidence: 99%