Zhenzhen Weng

I am a ML engineer at Waymo Perception working on multi-modal (image, video, text) foundation models for self-driving.

I received my Ph.D. in Computational and Mathematical Engineering (ICME) from Stanford University where I was advised by Prof. Serena Yeung.

I am broadly interested in computer vision and machine learning. My PhD research was focused on human-centric 3D perception and generative models. I interned at Waymo Research where I worked on human-centric representation learning from LiDAR data, as well as Adobe Research where I worked on generalizable single-view human NeRF prediction.

Prior to my Ph.D, I received B.S. in Computer Science and B.S. in Mathematics from Carnegie Mellon University. I also previously worked as a Research Engineer for a fund manager on the East Coast, on large-scale backtesting and portfolio optimization services.

Email  /  Linkedin  /  Twitter  /  Google Scholar

News

June, 2024: Joined Waymo Perception to work on large multi-modal foundation models for self-driving.

May, 2024: Defended my dissertation: Human-Centric Perception with Limited Supervision: Improving Generalizability In the Wild.

Oct, 2023: 1 paper accepted to 3DV 2024.

June, 2023: Attended CVPR 2023 in Vancouver, Canada.

Feb, 2023: 2 papers accepted to CVPR 2023.

Research and Publications
Template-Free Single-View 3D Human Digitalization with Diffusion-Guided LRM
Zhenzhen Weng, Jingyuan Liu, Hao Tan, Zhan Xu, Yang Zhou, Serena Yeung-Levy, Jimei Yang
Preprint | Website

Reconstructing 3D humans from a single image has been extensively investigated. However, existing approaches often fall short on capturing fine geometry and appearance details, hallucinating occluded parts with plausible details, and achieving generalization across unseen and in-the-wild datasets. We present Human-LRM, a diffusion-guided feed-forward model that predicts the implicit field of a human from a single image. Leveraging the power of the state-of-the-art reconstruction model (i.e., LRM) and generative model (i.e Stable Diffusion), our method is able to capture human without any template prior, e.g., SMPL, and effectively enhance occluded parts with rich and realistic details.

Diffusion-HPC: Synthetic Data Generation for Human Mesh Recovery in Challenging Domains
Zhenzhen Weng, Laura Bravo, Serena Yeung
3DV 2024 (Spotlight) | Website | Code

Recent text-to-image generative models such as Stable Diffusion often struggle to preserve plausible human structure in the generations. We propose Diffusion model with Human Pose Correction (Diffusion-HPC), a method that generates photo-realistic images with plausible posed humans by injecting prior knowledge about human body structure. The generated image-mesh pairs are well-suited for downstream human mesh recovery task.

ZeroAvatar: Zero-shot 3D Avatar Generation from a Single Image
Zhenzhen Weng, Zeyu Wang, Serena Yeung
Preprint | Website

We present ZeroAvatar, a method that introduces the explicit 3D human body prior to the optimization process. We show that ZeroAvatar significantly enhances the robustness and 3D consistency of optimization-based image-to-3D avatar generation, outperforming existing zero-shot image-to-3D methods.

3D Human Keypoints Estimation from Point Clouds in the Wild without Human Labels
Zhenzhen Weng, Alexander S. Gorban, Jingwei Ji, Mahyar Najibi, Yin Zhou, Dragomir Anguelov
Conference on Computer Vision and Pattern Recognition (CVPR), 2023
Paper | Project

We propose GC-KPL - Geometry Consistency inspired Key Point Leaning. By training on the large WOD training set without any annotated keypoints, we attain reasonable performance as compared to the fully supervised approach. Further, the backbone benefits from the unsupervised training and is useful in downstream fewshot learning of keypoints, where fine-tuning on only 10 percent of the labeled training data gives comparable performance to fine-tuning on the entire set.

NeMo: 3D Neural Motion Fields from Multiple Video Instances of the Same Action
Kuan-Chieh Wang, Zhenzhen Weng, Maria Xenochristou, Joao Pedro Araujo, Jeffrey Gu, C. Karen Liu, Serena Yeung
Conference on Computer Vision and Pattern Recognition (CVPR) (Highlight), 2023
Paper | Website

We aim to bridge the gap between monocular HMR and multi-view MoCap systems by leveraging information shared across multiple video instances of the same action. We introduce the Neural Motion (NeMo) field. It is optimized to represent the underlying 3D motions across a set of videos of the same action.

Domain Adaptive 3D Pose Augmentation for In-the-wild Human Mesh Recovery
Zhenzhen Weng, Kuan-Chieh (Jackson) Wang, Angjoo Kanazawa, Serena Yeung
International Conference on 3D Vision (3DV), 2022
Paper | Project Page | Code

We propose Domain Adaptive 3D Pose Augmentation (DAPA), a data augmentation method that combines the strength of methods based on synthetic datasets by getting direct supervision from the synthesized meshes.

Holistic 3D Human and Scene Mesh Estimation from Single View Images
Zhenzhen Weng, Serena Yeung
Conference on Computer Vision and Pattern Recognition (CVPR), 2021
Paper

We propose a holistically trainable model that perceives the 3D scene from a single RGB image, estimates the camera pose and the room layout, and reconstructs both human body and object meshes.

Unsupervised Discovery of the Long-Tail in Instance Segmentation Using Hierarchical Self-Supervision
Zhenzhen Weng, Mehmet Giray Ogut, Shai Limonchik, Serena Yeung
Conference on Computer Vision and Pattern Recognition (CVPR), 2021
Paper

We propose a method that can perform unsupervised discovery of long-tail categories in instance segmentation, through learning instance embeddings of masked regions.

Slice-based learning: A programming model for residual learning in critical data slices
Vincent S Chen, Sen Wu, Zhenzhen Weng, Alexander Ratner, Christopher Ré
The Conference and Workshop on Neural Information Processing Systems (NeurIPS), 2019
Paper

We introduce the challenge of improving slice-specific performance without damaging the overall model quality, and proposed the first programming abstraction and machine learning model to support these actions.

Utilizing Weak Supervision to Infer Complex Objects and Situations in Autonomous Driving Data
Zhenzhen Weng, Paroma Varma, Alexander Masalov, Jeffrey Ota, Christopher Ré
IEEE Intelligent Vehicles Symposium (IEEE IV), 2019
Paper

We introduced weak supervision heuristics as a methodology to infer complex objects and situations by combining simpler outputs from current, state-of-the art object detectors.

Work Experience

Research Scientist Intern @ Adobe Research, San Jose, CA, Jun - Sept, 2023

Research Intern (Perception) @ Waymo, Mountain View, CA, Jun - Nov, 2022

Machine Learning Engineer @ VMware, Palo Alto, CA, Jun - Sept, 2019

Research Engineer @ AQR, Greenwich, CT, 2016 - 2018



Webpage template and source code from Jon Barron.