TL;DR: Modular 3D scene generation from text using 360° panoramas, object-centric decomposition, and hybrid inpainting for immersive navigation and editing.
Our method tackles text-to-3D scene generation by first creating a panoramic image with a finetuned diffusion model, serving as geometric and stylistic prior. Relevant instances of objects are segmented, reconstructed in high-fidelity and placed in the background environment. The background is optimized for immersive viewing with a combination of 2D and 3D inpainting techniques.
We guide the 360° panorama generation process using a perspective image derived from the same prompt, providing soft conditioning without enforcing pixel-level alignment. We achieve this using an IP-Adapter-style mechanism that introduces separate cross-attention layers in all transformer blocks of the diffusion model. We jointly fine-tune the panoramic LoRA with the IP-adapter using random perspective renders of the equirectangular image, enabling effective style transfer from perspective to panoramic images.
Our object reconstruction pipeline leverages the panorama and style information to generate a high-resolution reference image to be used for multi-view generation. The generated multi-view images are then transformed into 3D Gaussian splats through a reconstruction pipeline. Finally, we align the generated object with the original and place it in the scene.
Our hybrid inpainting strategy combines 2D and 3D techniques: large-scale holes resulting from object removal are inpainted in the 360° image for global coherence, while smaller disocclusions caused by the 3D projection are addressed with 3D inpainting. The process proceeds in three steps: initialization and pretuning of the 3DGS point cloud, incremental inpainting to populate disoccluded regions with new Gaussians, and multi-view fine-tuning with score distillation to ensure consistency across viewpoints.
DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting. ECCV 2024.
LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation. SIGGRAPH 2025.
Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models. ICCV 2023.
InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models. ArXiv 2025.