Transformers in 3D Point Clouds: A Survey

Lu, Dening; Quan, Xie; Wei, Mingqiang; Xu, Li; Li, Jonathan

doi:10.48550/arxiv.2205.07417

Cited by 10 publications

(14 citation statements)

References 104 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Early fusion methods, for instance, involve rendering 3D information as multi-view 2D images with an additional depth channel (RGBD), which can then be processed by standard 2D convolutions (Cui et al, 2022) (MVCNN). Alternatively, 2D images can be rendered as a 3D graph, tree, or raster point cloud representation (Lu et al, 2022). However, these 2D methods often lose some 3D geometric context and struggle with per-point label prediction.…”

Section: Related Workmentioning

confidence: 99%

“…Recent advancements in MVCNN networks include ShapeConv (Cao et al, 2021) and FPS-Net (Xiao et al, 2021). On the other hand, late fusion combines the outputs of multiple networks and averages the results, for example, by integrating Point Transformers (Lu et al, 2022) with purely image-based networks. The advantage here is that each modality can be trained separately, leveraging numerous available benchmarks.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Combining Image and Point Cloud Segmentation to Improve Heritage Understanding

Bassier,

Mazzacca,

Battisti

et al. 2024

Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci.

View full text Add to dashboard Cite

Abstract. Current 2D and 3D semantic segmentation frameworks are developed and trained on specific benchmark datasets, often rich of synthetic data, and when they are applied to complex and real-world heritage scenarios they offer much lower accuracy than expected. In this work, we present and demonstrate an early and late fusion of methods for semantic segmentation in cultural heritage applications. We rely on image datasets, point clouds and BIM models. The early fusion utilizes multi-view rendering to generate RGBD imagery of the scene. In contrast, the late fusion approach merges image-based segmentation with a Point Transformer applied to point clouds. Two scenarios are considered and inference results show that predictions are primarily influenced by whether the scene has a predominantly geometric or texture-based signature, underscoring the necessity of fusion methods.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Combining Image and Point Cloud Segmentation to Improve Heritage Understanding

Bassier,

Mazzacca,

Battisti

et al. 2024

Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci.

View full text Add to dashboard Cite

show abstract

“…PointNet mainly relies on local feature learning to aggregate global information progressively, it is still not efficient for robotic grasping, which requires an effective encoding of the global information of an input. In computer vision and graphics, researchers have explored the use of transformer models on point cloud processing [29], such as point cloud segmentation [20], classification [30], and shape completion [31]. However, the number of points in a point cloud input is not fixed and too high to be processed with multi-head attention efficiently, which is especially serious in a real robot scenario.…”

Section: A 6-dof Grasping On Point Cloudmentioning

confidence: 99%

6-DoF Robotic Grasping with Transformer

Zhao¹,

Yu²,

Wu³

et al. 2023

Preprint

View full text Add to dashboard Cite

Robotic grasping aims to detect graspable points and their corresponding gripper configurations in a particular scene, and is fundamental for robot manipulation. Existing research works have demonstrated the potential of using a transformer model for robotic grasping, which can efficiently learn both global and local features. However, such methods are still limited in grasp detection on a 2D plane. In this paper, we extend a transformer model for 6-Degree-of-Freedom (6-DoF) robotic grasping, which makes it more flexible and suitable for tasks that concern safety. The key designs of our method are a serialization module that turns a 3D voxelized space into a sequence of feature tokens that a transformer model can consume and skip-connections that merge multiscale features effectively. In particular, our method takes a Truncated Signed Distance Function (TSDF) as input. After serializing the TSDF, a transformer model is utilized to encode the sequence, which can obtain a set of aggregated hidden feature vectors through multi-head attention. We then decode the hidden features to obtain per-voxel feature vectors through deconvolution and skipconnections. Voxel feature vectors are then used to regress parameters for executing grasping actions. On a recently proposed pile and packed grasping dataset, we showcase that our transformerbased method can surpass existing methods by about 5% in terms of success rates and declutter rates. We further evaluate the running time and generalization ability to demonstrate the superiority of the proposed method.

show abstract

“…Additionally, positional encoding conveys information about token positions (see Figure 2). These benefits have spurred significant interest in transformers across various AI domains [76][77][78][79][80][81][82], notably the audio community. This has given rise to diverse architectures such as Wav2Vec [83], Whisper [84], FastPitch [85], MusicBERT [86], and others [26,87,88].…”

Section: Transformers For Audio Processingmentioning

confidence: 99%

A survey on deep reinforcement learning for audio-based applications

Latif

Cuayáhuitl

Pervez³

et al. 2022

Artif Intell Rev

View full text Add to dashboard Cite

Deep reinforcement learning (DRL) is poised to revolutionise the field of artificial intelligence (AI) by endowing autonomous systems with high levels of understanding of the real world. Currently, deep learning (DL) is enabling DRL to effectively solve various intractable problems in various fields including computer vision, natural language processing, healthcare, robotics, to name a few. Most importantly, DRL algorithms are also being employed in audio signal processing to learn directly from speech, music and other sound signals in order to create audio-based autonomous systems that have many promising applications in the real world. In this article, we conduct a comprehensive survey on the progress of DRL in the audio domain by bringing together research studies across different but related areas in speech and music. We begin with an introduction to the general field of DL and reinforcement learning (RL), then progress to the main DRL methods and their applications in the audio domain. We conclude by presenting important challenges faced by audio-based DRL agents and by highlighting open areas for future research and investigation. The findings of this paper will guide researchers interested in DRL for the audio domain.

show abstract

Transformers in 3D Point Clouds: A Survey

Cited by 10 publications

References 104 publications

Combining Image and Point Cloud Segmentation to Improve Heritage Understanding

Combining Image and Point Cloud Segmentation to Improve Heritage Understanding

6-DoF Robotic Grasping with Transformer

A survey on deep reinforcement learning for audio-based applications

Contact Info

Product

Resources

About