river-zhang / gta Goto Github PK

[NeurIPS 23] Official repository for NeurIPS 2023 paper "Global-correlated 3D-decoupling Transformer for Clothed Avatar Reconstruction"

Home Page: https://river-zhang.github.io/GTA-projectpage/

Python 98.14% Shell 0.53% GLSL 1.34%

3d clothed-humans clothed-people-digitalization digital human reconstruction vision neurips-2023 python pytorch

gta's Introduction

Official Implementation for GTA (NeurIPS 2023)

Global-correlated 3D-decoupling Transformer for Clothed Avatar Reconstruction (NeurIPS 2023) [Paper] [Website]

News

[2023/12/29] We are thrilled to announce the release of our latest model, SIFU, offering enhanced geometry and texture reconstruction capabilities!
[2023/11/30] We release code including inference and testing.
[2023/9/26] We release the arXiv version (Paper in arXiv).

TODO

Hugging Face
[√] Release code
[√] Release paper

Introduction

Reconstructing 3D clothed human avatars from single images is a challenging task, especially when encountering complex poses and loose clothing. Current methods exhibit limitations in performance, largely attributable to their dependence on insufficient 2D image features and inconsistent query methods. Owing to this, we present the Global-correlated 3D-decoupling Transformer for clothed Avatar reconstruction (GTA), a novel transformer-based architecture that reconstructs clothed human avatars from monocular images. Our approach leverages transformer architectures by utilizing a Vision Transformer model as an encoder for capturing global-correlated image features. Subsequently, our innovative 3D-decoupling decoder employs cross-attention to decouple tri-plane features, using learnable embeddings as queries for cross-plane generation. To effectively enhance feature fusion with the tri-plane 3D feature and human body prior, we propose a hybrid prior fusion strategy combining spatial and prior-enhanced queries, leveraging the benefits of spatial localization and human body prior knowledge. Comprehensive experiments on CAPE and THuman2.0 datasets illustrate that our method outperforms state-of-the-art approaches in both geometry and texture reconstruction, exhibiting high robustness to challenging poses and loose clothing, and producing higher-resolution textures.

Installation

Ubuntu 20 / 18
GCC = 7.5.0
CUDA=11.3, GPU Memory > 20GB
Python = 3.8
PyTorch = 1.13.0 (official Get Started)
PyTorch3D (official INSTALL.md, recommend install-from-local-clone)

git clone https://github.com/River-Zhang/GTA.git
sudo apt-get install libeigen3-dev ffmpeg
cd GTA
conda env create -f environment.yaml
conda activate gta
pip install -r requirements.txt

Please download the checkpoint and place them in ./data/ckpt

Please follow ICON to download the extra data, such as HPS and SMPL.

Inference

python -m apps.infer -cfg ./configs/GTA.yaml -gpu 0 -in_dir ./examples -out_dir ./results -loop_smpl 100 -loop_cloth 200 -hps_type pixie

Testing

# 1. Register at http://icon.is.tue.mpg.de/ or https://cape.is.tue.mpg.de/
# 2. Download CAPE testset (Easy: 50, Hard: 100)
bash fetch_cape.sh 
# 3. Check CAPE testset via 3D visualization
python -m lib.dataloader_demo -v -c ./configs/train/GTA.yaml -d cape

# evaluation
python -m apps.train -cfg ./configs/train/GTA.yaml -test

# TIP: the default "mcube_res" is 256 in apps/train.

Bibtex

If this work is helpful for your research, please consider citing the following BibTeX entry.

@inproceedings{zhang2023globalcorrelated,
      title={Global-correlated 3D-decoupling Transformer for Clothed Avatar Reconstruction}, 
      author={Zhang, Zechuan and Sun, Li and Yang, Zongxin and Chen, Ling and Yang, Yi},
      booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
      year={2023}
}

Acknowledgement

Our implementation is mainly based on ICON and PIFu, and many thanks to the following open-source projects:

In addition, we sincerely thank Yuliang Xiu, the author of ICON and ECON for resolving many of our concerns in GitHub Issues.

More related papers about 3D avatars: https://github.com/pansanity666/Awesome-Avatars

gta's People

Stargazers

Watchers

Forkers

linecode strategist922 jackzhousz soeminemi

gta's Issues

about training

if i don't use your ckpt, How long does it take to get your results？

get_sampling_geo

The get_sampling_geo method is used to obtain the geometric sampling points in the code. smplx_verts can be obtained from smplx_param using the compute_smpl_verts method.

Use load_fit_body in comput_smpl_verts to obtain the mesh of smplx.

The first problem is that:
in load_fit_body:

in compute_smpl_verts:

load_fit_body returns the smpl_mesh value of a trimesh to smpl_out, but comput_smpl_verts returns only the vertices of smpl_out. Does that mean that the trimesh step for smpl_verts is not necessary? We just return smpl_verts to smpl_out and smpl_out returns it via comput_smpl_verts.
The second problem is that:
in load_fit_body:

What are the param['scale'] and param['translation'] and scale? How did you get it?Why can't we just use the vertices obtained by smpl_model via the smplx parameter?

Question about evaluation.

Hi, thanks for your work.

I noticed you used GT normals while testing on THuman2.0 for normal evaluation of different views (as Table 2 in the paper). I wonder if you are using GT normals too or only GT SMPL-X while testing on CAPE (as in Table 1)?

The test results seems to be inconsistent with the paper

I tried your code in Ubuntu20.04 cuda 11.8 pytorch2.0.1
I have got the results:

GTA.mp4

but the corresponding results in your paper is much better obviously, as shown in the following：

I want to know if my result is reasonable。

Ask for training code

Excellent work! I would like to ask if the training code is included in the repository?

Question about numbers, evaluation.

Hello I have two questions regarding the test results in your two papers GTA and SIFU.

I see Issue #5 and your explanation in it. But I still don't understand why your GTA numbers for THuman 2.0 is different.
In GTA paper, it is Chamfer 0.814, P2S 0.862, Normal 0.055. In SIFU, it is 0.73, 0.72, 0.04.
I noticed that in both of your papers evaluation code, you are using GT front and back normal. This is different from ICON's evaluation protocol where they use estimated normal. (YuliangXiu/ICON#183)
If using estimated normal, your GTA numbers for THuman 2.0 should be 1.12, 1.12, 0.065.

Could you please clarify these two points? Thank you!

about SMPLX model

In which part of your code do you use a Pixie-like model to estimate SMPLX parameters? I have read your code, and it seems that in the training process, you used the SMPLX parameters obtained from the THuman2.0 dataset as the prior enhancement query. Only when infer was an image, since it was not the image of the dataset, PIXIE model was used to predict the SMPLX parameters as the prior enhancement query. Is my understanding correct?

ViTencoder input

I found that the front/back normal maps are also used as input to the encoder and image to generate three-plane features. I want to know why? Will the result be improved?

Reading the code, I found that after obtaining the three-plane feature map, it was concatenated with the normal feature.

I only input the image through VitPose's pre-trained ViTencoder model to get the image features, and then also through the three decoders to get the three-plane features and splice with the normal features. Is that all right?

About PNSR

Amazing work! Can you provide code for calculating PNSR or tell me where to find the relevant code?

When will you realease the code?

Such an excellent job! Could you please tell me when do you plan to release the code?

Texture is vertex coloring or PNG ?

Hi, Can I know about the texture is using vertex coloring or exporting the png format texture?

Thank you.

The test results of the pre-trained model are inconsistent with the benchmark

I have successfully tested the pre-trained model, but I found that the results do not match the data in the benchmark, especially for Normal. Is this normal? Looking forward to your reply.

By the way, the results of GTA in your two papers are quiet different.

inference time

Hi, thanks for your great work.
I read your paper, but didn't see any mention of inference time for a single image.
Do you have a rough idea of what it would be on a modern GPU?

thanks!

About HGPIFu

When estimating the human body geometry, the query operation is performed in HGPIFuNet.

The first step is to project the sampled point set onto the image plane. But I found that the transforms parameter is None.

So in xyz = self.projection(points, calibs, transforms), only the points are rotated and translated.
Are all points in the world coordinate system? The projection operation only converts points from the world coordinate system to the camera coordinate system after rotation and translation, and does not project further to the image plane. Please give me some help.

Expecting a demo

Hi, River-Zhang
I'm studying papers on human body reconstruction, and I have read your paper. It is a very nice work! May I ask when will you update the open source? Expecting your demo~

About the version of pymeshlab

During inferencing, I first installed the current pymeshlab version of 2023.12 and encountered the
AttributeError: 'pymeshlab.pmeshlab.MeshSet' object has no attribute 'laplacian_smooth'

Then I changed the pymeshlab to 2022 and finished the inference successfully.
Maybe the new version of pymeshlab is mismatched with the code.

Strange surface of inferenced results

Thanks for your great work!
I encountered some problems during inferencing.
Would you please help me?
My inference results have strange surfaces just as #7

I noticed that an ERROR occurred, although it didn't stop the inference progress:

Resume MLP weights from ./data/ckpt/GTA.ckpt
Resume normal model from ./data/ckpt/normal.ckpt
Using pixie as HPS Estimator

Dataset Size: 5
  0%|                                                                                                                                                            | 0/5 [00:00<?, ?it/s]
2024-03-02 16:02:28.809516226 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:515 CreateExecutionProviderInstance] Failed to create TensorrtExecutionProvider. 
Please reference https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#requirements to ensure all dependencies are met.
1eca7a73c3c61d9debde493de37c7d99:   0%|                                                                                                                          | 0/5 [00:06<?, ?it/s
Body Fitting --- normal: 0.089 | silhouette: 0.043 | Total: 0.132:  12%|█████████▎                                                                    | 12/100 [00:01<00:13,  6.32it/s]
1eca7a73c3c61d9debde493de37c7d99:   0%|                                                                                                                          | 0/5 [00:08<?, ?it/s]

Is it normal that this error ocurred during inferencing?

I tried to change the onnxruntime-gpu and TensorRT's version but it didn't work.

My environment is:
CUDA 11.7
pytorch 1.13.1
onnxruntime-gpu 1.14
TensorRT 8.5.3.1

How do you create train/test split?
For the test set, how many views do you render per subject, and what is the FOV?

Thank you in advance!