2. Related Work
2.1. NeRF Editing and 2.2. Inpainting Techniques
2.3. Text-Guided Visual Content Generation
3.1. Training View Pre-processing
4. Experiments and 4.1. Experimental Setups
5. Conclusion and 6. References
NeRF Editing, as pioneered by [5, 14, 38], capitalizes on the capabilities of Neural Radiance Fields (NeRFs) to enable intricate manipulations and edits within 3D scenes and objects, including object removal, geometry transformations, and appearance editing. Enabling editing within the implicit density and radiance field has enormous promise across many fields, including virtual reality, augmented reality, content development, and beyond. Nevertheless, these prior works either edit the geometry of already existing contents in NeRF, or edit only the appearance of the scene. None of these works targets at generating novel contents consistent with the underlying NeRF given by the input images to be inpainted. For this task, [41] generates new texture on an existing object, but does not generate new geometry; [27] generates promptable geometrical content around a target object, but the generation is limited near the object and cannot replace the existing object. In our work, we focus on a subtask of NeRF editing, generative inpainting in NeRF.
The volume-renderable representation and the training mode of NeRF introduce a natural bridge between the represented 3D or 4D scene and its multiview observations. This enables image editing techniques to be applied on NeRF inpainting.
Recent advancements in 2D image inpainting have primarily relied on generative models [29], particularly Generative Adversarial Networks (GANs) [36, 37] and Stable Diffusion (SD) models [17, 23]. These models have demonstrated remarkable capabilities in generating visually plausible predictions for missing pixels. Specifically, the capacity to simulate complex data distributions has drawn notable attention to Stable Diffusion [23], an extension of the denoising diffusion probabilistic model that supports inpainting guided by text prompts. This method produces inpainted pixels and samples of superior quality by using controlled diffusion processes. Diffusion models and noise-conditioned score networks are vital to stable diffusion as they interact to optimize the generation process. Researchers have leveraged Stable Diffusion inpainting for various applications, including semantic inpainting, texture synthesis, and realistic object removal. These applications involve understanding contextual information in the image and generating inpaintings that seamlessly integrate with the existing scene.
However, most of these works have been limited to synthesis in the image domain and have yet to explicitly consider maintaining fidelity to the 3D structure across various viewpoints. Several prior works [19, 31] addressed the pure NeRF inpainting problem, which removes target objects and infers the background without generating prompt-guided contents. Notably, our study not only goes beyond traditional 2D image inpainting, addressing the challenge of inpainting within 3D and 4D scenes while maintaining view consistency even under perspective changes, but is also promptable in the sense that the inpainted content is matched with the text description while consistent with the underlying 3D or 4D spatiotemporal scene to be inpainted.