GaussianAvatar: Efficient Animatable Human Modeling from Monocular Video Using Gaussians-on-Mesh



Abstract


We introduce GaussianAvatar, a novel approach for real-time, memory-efficient, high-quality animatable human modeling. GaussianAvatar takes as input a single monocular video to create a digital avatar capable of re-articulation in new poses and real-time rendering from novel viewpoints, while seamlessly integrating with rasterization-based graphics pipelines. Central to our method is the Gaussians-on-Mesh representation, a hybrid 3D model combining rendering quality and speed of Gaussian splatting with geometry modeling and compatibility of deformable meshes. We assess GaussianAvatar on ZJU-MoCap data and various YouTube videos. GaussianAvatar matches or surpasses current monocular human modeling algorithms in rendering quality and significantly outperforms them in computational efficiency (43 FPS) while being memory-efficient (3.63 MB per subject).


Novel Pose Synthesis

In the following we present novel pose synthesis. We use the texts from HumanML3D and Human Motion Diffusion Model (MDM) to generate the target poses.

ZJU-MoCap Subject 377
"Moving hands around near face."
Target pose
HumanNeRF
MonoHuman
GaussianAvatar (Ours)

ZJU-MoCap Subject 387
"A person crosses their arms."
Target pose
HumanNeRF
MonoHuman
GaussianAvatar (Ours)

ZJU-MoCap Subject 392
"Running forward in a diagonal line."
Target pose
HumanNeRF
MonoHuman
GaussianAvatar (Ours)

ZJU-MoCap Subject 393
"He is running down then stopped and moved his left hand."
Target pose
HumanNeRF
MonoHuman
GaussianAvatar (Ours)

ZJU-MoCap Subject 394
"A man crouches down as he walks forward and kicks with his right leg."
Target pose
HumanNeRF
MonoHuman
GaussianAvatar (Ours)

Youtube Videos
"A person remained sitting down."
Target pose
HumanNeRF
MonoHuman
GaussianAvatar (Ours)

Target pose
HumanNeRF
MonoHuman
GaussianAvatar (Ours)

Target pose
HumanNeRF
MonoHuman
GaussianAvatar (Ours)


Novel View Synthesis

In the following we present 360° freeview rendering as well as the rendered normal maps.

ZJU-MoCap Subject 377
Reference image
HumanNeRF
MonoHuman
GaussianAvatar (Ours)
Pseudo ground-truth
HumanNeRF
MonoHuman
GaussianAvatar (Ours)

ZJU-MoCap Subject 386
Reference image
HumanNeRF
MonoHuman
GaussianAvatar (Ours)
Pseudo ground-truth
HumanNeRF
MonoHuman
GaussianAvatar (Ours)

ZJU-MoCap Subject 387
Reference image
HumanNeRF
MonoHuman
GaussianAvatar (Ours)
Pseudo ground-truth
HumanNeRF
MonoHuman
GaussianAvatar (Ours)

ZJU-MoCap Subject 392
Reference image
HumanNeRF
MonoHuman
GaussianAvatar (Ours)
Pseudo ground-truth
HumanNeRF
MonoHuman
GaussianAvatar (Ours)

ZJU-MoCap Subject 393
Reference image
HumanNeRF
MonoHuman
GaussianAvatar (Ours)
Pseudo ground-truth
HumanNeRF
MonoHuman
GaussianAvatar (Ours)

ZJU-MoCap Subject 394
Reference image
HumanNeRF
MonoHuman
GaussianAvatar (Ours)
Pseudo ground-truth
HumanNeRF
MonoHuman
GaussianAvatar (Ours)