Lay-A-Scene: Personalized 3D Object Arrangement Using Text-to-Image Priors

Lay-A-Scene: Personalized 3D Object Arrangement Using Text-to-Image Priors

1Bar-Ilan University
2NVIDIA

* These authors contributed equally to this work.

We present Lay-A-Scene: method for addressing the 3D-object-Arrangement task. The key idea is that since the objects already exist, we can use the text-to-image models to generate a single image that can serve as a layout to find the arrangement of the objects.

Abstract


Generating 3D visual scenes is at the forefront of visual generative AI, but current 3D generation techniques struggle with generating scenes with multiple high-resolution objects. Here we introduce Lay-A-Scene, which solves a task of Open-set 3D Object Arrangement, effectively arranging unseen objects. Given a set of 3D objects, the task is to find a plausible arrangement of these objects in a scene. We address this task by leveraging pre-trained text-to-image models. We personalize the model and explain how to generate images of a scene that contains multiple predefined objects without neglect. Then, we describe how to infer the 3D poses and arrangement of objects from a 2D generated image by finding a consistent projection of objects onto the 2D scene. We evaluate the quality of Lay-A-Scene using 3D objects from Objaverse and human raters and find that it often generates coherent and feasible 3D object arrangements.


Overview


Generated Image                                                                    Output Scene   

this slowpoke moves this slowpoke moves

Left: a scene image generated by the text-to-image model with these objects. Right: the completed scene.



In many cases, one may have access to existing models of 3D objects and is only interested in finding plausible arrangements of given objects. This setup, which we call here Openset-3D-Arrange, can be viewed as a 3D parallel of the personalization problem in image generation. It can also be viewed as a complementing problem to scene-generation papers, which generate objects in a scene given a layout. 3D-Arrange is about the reverse problem -- find a plausible layout for given 3D objects. 3D object arrangement has been previously addressed by training models using a specific training dataset. The question remains if this problem can be solved by distilling information from current text-to-image models.

Here we describe Lay-A-Scene, a method for addressing the 3D-object-Arrangement task. The key idea is that since the objects already exist, we can use the text-to-image models to generate a single image that can serve as a layout to find the arrangement of the objects. Instead of multiple inferences of those models to generate a full scene, we generate an image with a plausible layout using a single forward pass. Later, we match each object and its appearance in the generated image to infer the object's position. Those positions with the objects create a full scene.

Lay-A-Scene approach has several main advantages. Unlike other 3D-arrange methods it can handle new objects through personalization without retraining the foundation model. Unlike graph-based 3D arrangement methods, which expect users to provide a set of spatial relations, Lay-A-Scene generates the scene layout based on the prior learned by text-to-image diffusion models.

Several key challenges emerge when taking this approach approach. First, how could we infer 3D position of given objects from generated images? distorted or colliding. We show how to add prior information about physical considerations as soft constraints to the Prospective-n-Points optimization to generate scenes that are more coherent and natural. We call this approach Side-information-PNP, and find that it greatly improves the generated scenes. We also note that Text-to-image models often suffer from entity neglect, where generated images do not contain all entities mentioned in their prompts. This problem becomes more severe when objects are unusual or incoherent, and when the number of objects grows. In our setting, the PnP procedure can be used to filter out images with entity neglect.

Pipeline


Lay-A-Scene consists of two phases. First, given objects are used to personalize a text-to-image model and a scene image is generated. In the second phase, we find a transformation $T_i$ for each 3D object $i$ to match the 2D arrangement presented in the generated scene image. $T_i$ is found using our \ourpnp{}, by matching the DIFT representation of objects and scene image.

Results


We evaluate our method and baselines on 3D furniture meshes taken from the Objaverse dataset, a repository featuring over 800K high-quality 3D assets.

Objaverse dataset


Generated Image                            Output Scene                            Generated Image                            Output Scene     

this slowpoke moves this slowpoke moves this slowpoke moves this slowpoke moves
this slowpoke moves this slowpoke moves this slowpoke moves this slowpoke moves
this slowpoke moves this slowpoke moves this slowpoke moves this slowpoke moves
this slowpoke moves this slowpoke moves this slowpoke moves this slowpoke moves

Example arranged in two columns, where each column has the following structure: Left: a scene image generated by the text-to-image model with these objects. Right: the completed scene. For more examples, please refer to the paper.

Citation


@article{rahamim2024lay,
    title={Lay-A-Scene: Personalized 3D Object Arrangement Using Text-to-Image Priors},
    author={Rahamim, Ohad and Segev, Hilit and Achituve, Idan and Atzmon, Yuval and Kasten, Yoni and Chechik, Gal},
    journal={arXiv preprint arXiv:2406.00687},
    year={2024}
    }