Recent advancements in text-to-image generation have enabled significant progress in zero-shot 3D shape generation. This is achieved by score distillation, a methodology that uses pre-trained text-to-image diffusion models to optimize the parameters of a 3D neural presentation, e.g. Neural Radiance Field (NeRF). While showing promising results, existing methods are often not able to preserve the geometry of complex shapes, such as human bodies. To address this challenge, we present ZeroAvatar, a method that introduces the explicit 3D human body prior to the optimization process. Specifically, we first estimate and refine the parameters of a parametric human body from a single image. Then during optimization, we use the posed parametric body as additional geometry constraint to regularize the diffusion model as well as the underlying density field. Lastly, we propose a UV-guided texture regularization term to further guide the completion of texture on invisible body parts. We show that ZeroAvatar significantly enhances the robustness and 3D consistency of optimization-based image-to-3D avatar generation, outperforming existing zero-shot image-to-3D methods.
System figure: Given a single image, we first estimate the body pose, shape, and UV map of the person. We use the estimated body mesh to initialize the density field of the 3D representation. Then, we optimize the appearance and refine the geometry of the person using Score Distillation Sampling, during which depth information from the posed body model is used as conditioning in addition to text (i.e. image caption). When optimizing given a sampled novel view, we additionally use the inferred UVs on the invisible body parts to aid the learning of appearance.
@article{weng2023zeroavatar,
title={ZeroAvatar: Zero-shot 3D Avatar Generation from a Single Image},
author={Weng, Zhenzhen and Wang, Zeyu and Yeung, Serena},
journal={arXiv preprint arXiv:2305.16411},
year={2023}
}
Webpage template from Deep Image Prior. |