Beyond Image to Depth: Improving Depth Prediction using Echoes

Parida, Kranti Kumar; Srivastava, Siddharth; Sharma, Gaurav

doi:10.1109/cvpr46437.2021.00817

Cited by 33 publications

(18 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Further, recently Liu et al [46] showed that, the local aggregation operators used in various point cloud processing techniques, if carefully tuned, provide similar performances. Parida et al [47] showed that using region/point based properties from echoes or type of material can help in learning more robust representations. Therefore, the proposed modifications, with explicit prior on geometry, can be extended to other point cloud based deep networks and potentially motivate future works with simpler and more efficient networks for processing point clouds.…”

Section: Discussionmentioning

confidence: 99%

Exploiting Local Geometry for Feature and Graph Construction for Better 3D Point Cloud Processing with Graph Neural Networks

Srivastava¹,

Sharma²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

We propose simple yet effective improvements in point representations and local neighborhood graph construction within the general framework of graph neural networks (GNNs) for 3D point cloud processing. As a first contribution, we propose to augment the vertex representations with important local geometric information of the points, followed by nonlinear projection using a MLP. As a second contribution, we propose to improve the graph construction for GNNs for 3D point clouds. The existing methods work with a k-NN based approach for constructing the local neighborhood graph. We argue that it might lead to reduction in coverage in case of dense sampling by sensors in some regions of the scene. The proposed methods aims to counter such problems and improve coverage in such cases. As the traditional GNNs were designed to work with general graphs, where vertices may have no geometric interpretations, we see both our proposals as augmenting the general graphs to incorporate the geometric nature of 3D point clouds. While being simple, we demonstrate with multiple challenging benchmarks, with relatively clean CAD models, as well as with real world noisy scans, that the proposed method achieves state of the art results on benchmarks for 3D classification (ModelNet40) , part segmentation (ShapeNet) and semantic segmentation (Stanford 3D Indoor Scenes Dataset). We also show that the proposed network achieves faster training convergence, i.e. ∼ 40% less epochs for classification. The project details are available at https://siddharthsrivastava. github.io/publication/geomgcnn/

show abstract

Section: Discussionmentioning

confidence: 99%

Exploiting Local Geometry for Feature and Graph Construction for Better 3D Point Cloud Processing with Graph Neural Networks

Srivastava¹,

Sharma²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…On the same line of thought researchers has unified VAE with variant of transformers for various other applications such as story generation [28], response generation [55], sentiment analysis [29], and 3D human pose generation [56]. Recently, audio and visual modalities have been used jointly to improve various tasks such as zero-shot learning [57], depth estimation [58] etc.…”

Section: Previous Workmentioning

confidence: 99%

Learning Speaker-specific Lip-to-Speech Generation

Varshney¹,

Yadav²,

Namboodiri³

et al. 2022

Preprint

View full text Add to dashboard Cite

Understanding the lip movement and inferring the speech from it is notoriously difficult for the common person. The task of accurate lip-reading gets help from various cues of the speaker and its contextual or environmental setting. Every speaker has a different accent and speaking style, which can be inferred from their visual and speech features. This work aims to understand the correlation/mapping between speech and the sequence of lip movement of individual speakers in an unconstrained and large vocabulary. We model the frame sequence as a prior to the transformer in an auto-encoder setting and learned a joint embedding that exploits temporal properties of both audio and video. We learn temporal synchronization using deep metric learning, which guides the decoder to generate speech in sync with input lip movements. The predictive posterior thus gives us the generated speech in speaker speaking style. We have trained our model on the Grid and Lip2Wav Chemistry lecture dataset to evaluate single speaker natural speech generation tasks from lip movement in an unconstrained natural setting. Extensive evaluation using various qualitative and quantitative metrics with human evaluation also shows that our method outperforms the Lip2Wav Chemistry dataset(large vocabulary in an unconstrained setting) by a good margin across almost all evaluation metrics and marginally outperforms the state-of-theart on GRID dataset.

show abstract

“…Audio-Visual Learning: Recent research bridges the audio and vision for various cross-model learning tasks. Some have achieved remarkable performance in audio-visual action recognition [52,40,58], audio-visual correspondence [8,6,7], audio-visual synchronization [68,55,102], visual sound separation [31,93,92,99,39,100,79,101], visual to auditory [98,36,96,71,86], audio spatialisation [23,88,67,96,38,71,86,69], and audio-visual navigation [16,17,18,37,27,15,63]. In this work, we leverage audio-visual learning for better perceiving the geometrical structure of environment.…”

Section: Related Workmentioning

confidence: 99%

“…Several animal species, such as bats, dolphins, and some nocturnal birds, perceive spatial layout and locate objects through echolocation [73,38,69]. By using two ears to receive spatial sound, one can determine the objects' location by the Interaural Time Difference (ITD) and Interaural Level Difference (ILD).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Beyond Visual Field of View: Perceiving 3D Environment with Echoes and Vision

Li¹,

Rahtu²,

Zhao³

2022

Preprint

View full text Add to dashboard Cite

This paper focuses on perceiving and navigating 3D environments using echoes and RGB image. In particular, we perform depth estimation by fusing RGB image with echoes, received from multiple orientations. Unlike previous works, we go beyond the field of view of the RGB and estimate dense depth maps for substantially larger parts of the environment. We show that the echoes provide holistic and in-expensive information about the 3D structures complementing the RGB image. Moreover, we study how echoes and the wide field-of-view depth maps can be utilised in robot navigation. We compare the proposed methods against recent baselines using two sets of challenging realistic 3D environments: Replica and Matterport3D. The implementation and pre-trained models will be made publicly available.

show abstract

Beyond Image to Depth: Improving Depth Prediction using Echoes

Cited by 33 publications

References 32 publications

Exploiting Local Geometry for Feature and Graph Construction for Better 3D Point Cloud Processing with Graph Neural Networks

Exploiting Local Geometry for Feature and Graph Construction for Better 3D Point Cloud Processing with Graph Neural Networks

Learning Speaker-specific Lip-to-Speech Generation

Beyond Visual Field of View: Perceiving 3D Environment with Echoes and Vision

Contact Info

Product

Resources

About