2021
DOI: 10.48550/arxiv.2106.04180
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Image2Point: 3D Point-Cloud Understanding with 2D Image Pretrained Models

Abstract: 3D point-clouds and 2D images are different visual representations of the physical world. While human vision can understand both representations, computer vision models designed for 2D image and 3D point-cloud understanding are quite different. Our paper investigates the potential for transferability between these two representations by empirically investigating whether this approach works, what factors affect the transfer performance, and how to make it work even better. We discovered that we can indeed use t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(8 citation statements)
references
References 61 publications
0
4
0
Order By: Relevance
“…Pretraining from 2D ViT The transformer backbone can get multi-head self-attention pretrained weight loaded without any difficulties, exactly as any 2D ImageNet pretraining loading. This is different from a direct 3D Convectional Kernel Inflation [33,45] by maintaining the pure reasoning from patch understanding. We observed that one needs to use a small learning rates in first few epochs as a warm-up fine-tuning, to prevent catastrophic forgetting from 2D pretrained ViT.…”
Section: Incorporating 2d Reasoning Knowledgementioning
confidence: 93%
See 1 more Smart Citation
“…Pretraining from 2D ViT The transformer backbone can get multi-head self-attention pretrained weight loaded without any difficulties, exactly as any 2D ImageNet pretraining loading. This is different from a direct 3D Convectional Kernel Inflation [33,45] by maintaining the pure reasoning from patch understanding. We observed that one needs to use a small learning rates in first few epochs as a warm-up fine-tuning, to prevent catastrophic forgetting from 2D pretrained ViT.…”
Section: Incorporating 2d Reasoning Knowledgementioning
confidence: 93%
“…3D Multi View Fusion [35,20] has been viewed as one connection from Images to 3D Shape domain. A 2D to 3D inflation solution of CNN has been discussed in Image2Point [45], where the copy of convolutional kernels in inflated dimension can help 3D voxel/point cloud understanding and requires less labeled training data in target 3D task. On a related note, for video as a 2D+1D data, TimeSFormer [4] proposes an inflated design from 2D transformers, plus memorizing information across frames using another transformer along the additional time dimension.…”
Section: Transferring Knowledge Between 2d and 3dmentioning
confidence: 99%
“…[48] introduces 2D-assisted pre-training; ref. [49] expands 2D convolution into 3D convolution; and ref. [50] proposes a dense foreground-guided feature imitation method and sparse instance distillation method to transfer spatial knowledge from LiDAR to multiple camera images for 3D target detection.…”
Section: Knowledge Distillation Methodsmentioning
confidence: 99%
“…Since no prior studies have addressed open-vocabulary 3D point cloud detection problems by avoiding the need for human annotations, we compare our model's performance with a 3D detection model (Lu et al 2022) that utilizes human annotations and is exposed to open-set knowledge from other modalities. We select our open-testing classes following (Lu et al 2022) and adopt their models discussed (Liu et al 2021;Qi et al 2019;Zhang et al 2020;Misra, Girdhar, and Joulin 2021;Zhang et al 2022;Xu et al 2021;Zhou et al 2022;Lu et al 2022) in an open set setting for comparison. We denote our model trained in an annotationfree setting as FM-OV3D*, and FM-OV3D represents the model trained only utilizing knowledge blending, utilizing Detic (Zhou et al 2022) (Lu et al 2022), which achieves strong performance by leveraging the knowledge in 2D image datasets, our model blends the knowledge from both textual and 2D visual modalities.…”
Section: Performance On Open-vocabulary 3d Detectionmentioning
confidence: 99%