TL;DR: Modular 3D scene generation from text using 360° panoramas, object-centric decomposition, and hybrid inpainting for immersive navigation and editing.



Overview

Our method tackles text-to-3D scene generation by first creating a panoramic image with a finetuned diffusion model, serving as geometric and stylistic prior. Relevant instances of objects are segmented, reconstructed in high-fidelity and placed in the background environment. The background is optimized for immersive viewing with a combination of 2D and 3D inpainting techniques.

Description of image


Panorama Generation

We guide the 360° panorama generation process using a perspective image derived from the same prompt, providing soft conditioning without enforcing pixel-level alignment. We achieve this using an IP-Adapter-style mechanism that introduces separate cross-attention layers in all transformer blocks of the diffusion model. We jointly fine-tune the panoramic LoRA with the IP-adapter using random perspective renders of the equirectangular image, enabling effective style transfer from perspective to panoramic images.

Panorama Generation

Panorama Examples

Instance Generation

Object Generation

Our object reconstruction pipeline leverages the panorama and style information to generate a high-resolution reference image to be used for multi-view generation. The generated multi-view images are then transformed into 3D Gaussian splats through a reconstruction pipeline. Finally, we align the generated object with the original and place it in the scene.


Hybrid Inpainting

Our hybrid inpainting strategy combines 2D and 3D techniques: large-scale holes resulting from object removal are inpainted in the 360° image for global coherence, while smaller disocclusions caused by the 3D projection are addressed with 3D inpainting. The process proceeds in three steps: initialization and pretuning of the 3DGS point cloud, incremental inpainting to populate disoccluded regions with new Gaussians, and multi-view fine-tuning with score distillation to ensure consistency across viewpoints.

Hybrid Inpainting