CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields

Wang, Can; Chai, Menglei; He, Mingming; Chen, Dongdong; Liao, Jing

doi:10.1109/cvpr52688.2022.00381

Cited by 168 publications

(73 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Dream Fields [JMB*21] combines NeRF with CLIP to generate diverse 3D objects solely from natural language descriptions, by optimizing the radiance field via multi‐view constraints based on the CLIP scores on the image caption. CLIP‐NeRF [WCH*21] proposes a CLIP‐based shape and appearance mapper to control a conditional NeRF.…”

Section: Applicationsmentioning

confidence: 99%

Advances in Neural Rendering

Tewari

Thies²,

Mildenhall

et al. 2022

Computer Graphics Forum

172

View full text Add to dashboard Cite

show abstract

Section: Applicationsmentioning

confidence: 99%

Advances in Neural Rendering

Tewari

Thies²,

Mildenhall

et al. 2022

Computer Graphics Forum

172

View full text Add to dashboard Cite

show abstract

“…Language-vision approaches. Self-supervised language-vision models have gone through rapid advances in recent years [62,66,48] due to their impressive generalizability. The seminal work CLIP [66] learns a joint language-vision embedding using more than 400 billion text-image pairs.…”

Section: Related Workmentioning

confidence: 99%

“…Self-supervised language-vision models have gone through rapid advances in recent years [62,66,48] due to their impressive generalizability. The seminal work CLIP [66] learns a joint language-vision embedding using more than 400 billion text-image pairs. The learned representation is semantically meaningful and expressive, thus has been adapted to various downstream tasks [82,71,49,79,65].…”

Section: Related Workmentioning

confidence: 99%

“…Videos and skeletal shapes are in two different modalities; hence establishing a similarity measure is hard. Inspired by the recent success of language-vision pretraining models for 3D [66,39], we utilize realistic rendering and pretraining image embedding models [47] to bridge this gap. Specifically, we first pre-render video footage for each character in the asset bank through a physically based renderer [7].…”

Section: Video-to-shape Retrievalmentioning

confidence: 99%

See 1 more Smart Citation

CASA: Category-agnostic Skeletal Animal Reconstruction

Wu¹,

Chen²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Neural Radiance Fields (NeRFs) [14] have demonstrated encouraging progress for view synthesis by learning an implicit neural scene representation. Since its origin, tremendous efforts have been made to improve its quality [28]- [31], speed [32]- [34], artistic effects [35]- [37], and generalization ability [17], [38]. Specifically, Mip-NeRF [39] propose to cast a conical frustum instead of a single ray for anti-aliasing.…”

Section: A Neural 3d Renderingmentioning

confidence: 99%

360FusionNeRF: Panoramic Neural Radiance Fields with Joint Guidance

Kulkarni¹,

Yin²,

Scherer³

2022

Preprint

View full text Add to dashboard Cite

We present a method to synthesize novel views from a single 360 • panorama image based on the neural radiance field (NeRF). Prior studies in a similar setting rely on the neighborhood interpolation capability of multi-layer perceptions to complete missing regions caused by occlusion, which leads to artifacts in their predictions. We propose 360FusionNeRF, a semi-supervised learning framework where we introduce geometric supervision and semantic consistency to guide the progressive training process. Firstly, the input image is re-projected to 360 • images and auxiliary depth maps are extracted at other camera positions. The depth supervision, in addition to the NeRF color guidance, improves the geometry of the synthesized views. Additionally, we introduce a semantic consistency loss that encourages realistic renderings of novel views. We extract these semantic features using a pre-trained visual encoder such as CLIP, a Vision Transformer trained on hundreds of millions of diverse 2D photographs mined from the web with natural language supervision. Experiments indicate that our proposed method can produce plausible completions of unobserved regions while preserving the features of the scene. When trained across various scenes, 360FusionNeRF consistently achieves the state-of-the-art performance when transferring to synthetic Structured3D dataset (PSNR ∼ 5%, SSIM ∼3% LPIPS ∼13%), real-world Matterport3D dataset (PSNR ∼3%, SSIM ∼3% LPIPS ∼9%) and Replica360 dataset (PSNR ∼8%, SSIM ∼2% LPIPS ∼18%). We provide the source code at https://github.com/MetaSLAM/360FusionNeRF.

show abstract

CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields

Cited by 168 publications

References 32 publications

Advances in Neural Rendering

Advances in Neural Rendering

CASA: Category-agnostic Skeletal Animal Reconstruction

360FusionNeRF: Panoramic Neural Radiance Fields with Joint Guidance

Contact Info

Product

Resources

About