“…Semantic-driven editing approaches, such as strokebased scene editing [36,41,70], text-driven image synthesis and editing [1,53,56], and attribute-based face editing [28,64], have greatly improved the ease of artistic creation. However, despite the great success of 2D image edit-ing and neural rendering techniques [14,44], similar editing abilities in the 3D area are still limited: (1) they require laborious annotation such as image masks [28,75] and mesh vertices [73,78] to achieve the desired manipulation; (2) they conduct global style transfer [12,13,16,21,79] while ignoring the semantic meaning of each object part (e.g., windows and tires of a vehicle should be textured differently); (3) they can edit on categories by learning a textured 3D latent representation (e.g., 3D-aware GANs with faces and cars etc.) [6,8,9,18,48,60,63,64], or at a coarse level [37,68] with basic color assignment or objectlevel disentanglement [32], but struggle to conduct texture editing on objects with photo-realistic textures or out-ofdistribution characteristics.…”