cvlab-columbia / zero123 Goto Github PK

View Code? Open in Web Editor NEW

2.6K 43.0 188.0 40.21 MB

Zero-1-to-3: Zero-shot One Image to 3D Object (ICCV 2023)

Home Page: https://zero123.cs.columbia.edu/

License: MIT License

Python 99.92% Shell 0.08%

image-to-3d novel-view-synthesis single-view-reconstruction stable-diffusion zero-shot

zero123's Issues

Understanding cross attention layer with a single context token

Hi, thanks for sharing this fantastic work.
I am trying to understand the cross attention used in the model. Here, the conditional context only has one token, i.e., the clip embedding concatenated with pose. As a result, the size of the cross attention matrix is [num_spatial_token, 1] and all attention weights would be one. The output is just copying the value vector to each spatial location (or add, if we consider the residual connection). It seems that the K and Q are redundant in this case. Is this the expected behavior?

model for human face

as you mentioned 'If you are trying out images of humans, especially faces, note that it is unfortunately not the intended use case. We would encourage trying out images of everyday objects instead, or even artworks.' I am wondering whether it's hard to generate 3d model of human faces or the model just still in training and will be released later?

will the training codes also be released?

Thanks for the awesome project !
I wonder whether the training codes of this study are also released or not !

Again, thanks for the awesome and very cool project.

How to test the ckpt?

Hi!
I have seen the updated README about training. Will you also release the command for testing?
Also, may I know how to use the ObjaverseDataModuleFromConfig by myself to have a flexible control, instead of using the trainer.fit()?

where is this config ?

Nice work !

configs/sd-objaverse-finetune-c_concat-256.yaml

it can not find the config and corresponding last.ckpt model.

Will it be shared later

How many objaverse objects are used when finetuning

hi, first show great respect to this work!
I just wondering how many objaverse objects are used in this project? I see the objaverse dataset is quite large lol.
Perhaps you can provide some tutorial about processing objaverse dataset?(like rendering etc)

Best,

Mesh extraction in 3D reconstruction

Hi, @ruoshiliu Thank you for making your incredible work public!
I'm looking into the code for 3D reconstruction based on SJC, and wondering if there is a code for extracting the mesh from the generated 3D, like what has been shared on the project page. Or, if there are any libraries or code repositories used, that information would also be of great help. Thank you!

R matrix from gradio_new.py

I have checked Appendix Section A of the paper regarding the camera coordinate system. In gradio_new.py, Is the Rotation matrix(camera_R ) in w2c format?
In gradio_new.py, you use camera_R to obtain the camera extrinsic but not given T matric of the camera. So I want to use pytorch3d look_at_view_transform to compute
extrinsic matric R|t using: R, T = look_at_view_transform(dist=radius, elev=90 - polar_deg, azim=azimuth_deg, up=((0, 1, 0),)). but I got a different result. In the image below, the result of R_jisuan is used pytorch3d to compute, and the camera_R_ori is your code result. how to compute the same R using pytorch3d in your code. Or you can give me advice on how to use your code to compute the T matric of the camera.

Looking forward to your reply!

Large weight file is hard to download from Google Drive

Thanks for this great job 🚀 !
However, the current weight files saved on Google Drive are too large to download. Is there any plan to upload the weight files to huggingface or split them by submodule?

Comparison version of Point-e

On your webpage you show comparisons in the single-view 3D reconstruction against Point-e. I was wondering which version of Point-e this is - whether you trained your own or used the open source release? And in either case what the specifics were?

Thank you

Able to export model to game pipeline?

Awesome work!Due to the high VRAM required,I'm unable to try by myself.Is it able to export the model and using it in other pipelines,like game?

Lack of evaluation code and confusion on evaluation dataset

Thanks for the interesting work and for releasing the code. Could you please provide more details on the evalution steps in your experiments ? Additionaly, I noticed that you evaluated the method on the GSO dataset and RTMV dataset, but it seems that the GSO dataset is a subset of RTMV dataset. Could you please clarify the evaluation procees and provide more information on how you evaluated on these two dataset ?

A question about camera pose

Hi,
Thanks for the amazing work. I would like to use your released data to do some experiments. I have some visualization examples of projecting the vertices of the 3d object bounding box to the 2d image, but I got some results like this:

I set the intrinsic matrix as K = np.asarray([512, 0, 256, 0, 512, 256, 0, 0, 1]).reshape(3, 3), which could be wrong. Is there any normalization process in your implementation? Could you please provide the correct intrinsic matrix?

Looking forward to your reply!

Questions on dataloader

Hi, I appreciate your excellent work. I tried to download your data and ran your train script. However, the released data has no file 'valid_paths.json' (I managed to find it in another issue), is it just all the subfolder names in views_release folder? Also in your dataset code:

if self.paths[index][-2:] == '_1': # dirty fix for rendering dataset twice
            total_view = 8
        else:
            total_view = 4

which means total view for each subfolder is 8 or 4. However, I found in my downloaded data, most scenes have more than 10 views, do you only random from 4 or 8 views during training?

Some error of the dataloader.

Thanks a lot for releasing this training scripts and data. When I run the training scrips, I get some errors.

We originally extract 100 objects from the dataset, and we run in 2 GPUs with the 32GB of VRAM, we set batch_size=16, num_workers=8 in config/sd-objaverse-finetune-c_concat-256.yaml, and it can run successfully.

However, then we use a larger dataset of 2000 objects, and continue to keep batch_size=16, num_workers=8, then get the error RuntimeError: DataLoader worker (pid(s) 8298) exited unexpectedly. I modify the config/sd-objaverse-finetune-c_concat-256.yaml. We try to set batch_size or num_workers to be smaller, but it still get the same error. We also try to set num_workers=0 and it did not work. Only we set batch_size=1，it did not have this error. So, I want to know where the error may occur, and is there any solution to solve this problem?

Out of memory on 24 GB VRAM GPU

I had some trouble to debug it, but it seems like an OutOfMemory error, as 24.2 GB are filled via 'nvidia-smi', leading to a failure:
cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
or with cuDNN disabled:
cuBLAS error: CUBLAS_STATUS_NOT_INITIALIZED
in the F.conv2D call inside .

If you got it running on a RTX3090, could you share the configuration changes?
I used a 128x128 RBG image as a smallest test.

Unable to train successfully

Hello and thank you for your very nice paper!

I am trying to train a view-conditional network using the code in zero123, but something is going wrong. I am wondering if my command is wrong, or if there is something else that I am missing.

I am using the command:

python main.py --base configs/sd-objaverse-finetune-c_concat-256.yaml --train --gpus=0,1,2,3 precision=16

I have trained for 10,000 steps and it is evident from the generations that something is going wrong. Do you know why this might be / should I be using a different command?

For context, the logged images look as follows:

inputs_gs-000000_e-000000_b-000000:

conditioning_gs-000000_e-000000_b-000000:

reconstruction_gs-000000_e-000000_b-000000

samples_gs-000000_e-000000_b-000000

samples_cfg_scale_3 00_gs-000000_e-000000_b-000000

Thank you so much for your help!

[Suggestion] Add documentation on how to run 3D reconstruction on user data.

Currently the readme provides an example of how one might do 3D reconstruction with an existing image and transforms_train.json pair but provides no details on how we'd start doing reconstructions with our own images.

Killed : cd zero123/zero123, then python gradio_new.py, get Killed

cd zero123/zero123, then python gradio_new.py, the program gets Killed.
The code locates in ' zero123/zero123/ldm/models/diffusion/ddpm.py ',

def instantiate_cond_stage(self, config):
    if not self.cond_stage_trainable:
        if config == "__is_first_stage__":
            print("Using first stage also as cond stage.")
            self.cond_stage_model = self.first_stage_model
        elif config == "__is_unconditional__":
            print(f"Training {self.__class__.__name__} as an unconditional model.")
            self.cond_stage_model = None
            # self.be_unconditional = True
        else:
            model = instantiate_from_config(config)

instantiate_from_config gets error, config is ' cond_stage_config:target: ldm.modules.encoders.modules.FrozenCLIPImageEmbedder'.
Did anyone else encounter similar situations？

CFG scale(guidance scale) for generation

Hi, when playing with the demo live, I found increase the default CFG(scale) will enhance the generated samples quality (less variance and shaper, like increase it to 10+)

Just curious about this

question about SJC-I

Hi, thanks for the great work!

I saw in the paper, one baseline called SJC-I was mentioned:"Finally, we adapted SJC [53], a diffusion-based text-to-3D model where the original text-conditioned diffusion model is replaced with an image-conditioned diffusion model, which we termed SJC-I". Just wondering what does the "image-conditioned diffusion model" refer to? (I guess it's not your finetuned view-conditioned diffusion model, right?)

Thanks!

CUDA out of memory with a 3090

When I try to run gradio_new.py I'm getting a CUDA OOM error, despite having >23GB of available memory. Is there something in the config I can tweak to overcome this?

Detailed parameters of cameras when generating renders results from Objaverse

Hi, thanks for providing the 1.5T rendered views!

As I'm using this dataset for my research, it occurs that only the transformation matrices are provided, while other camera parameters are missing, such as fov, camera_angle_x, rotation, etc (those from BlenderDataset). Since we are using the objects from Objaverse, where GTs are provided, it would be better to skip the calibration step and use the information from the 3D asset. Therefore, I'm wondering if you could provide more data about how each object is rendered.

Thanks!

Data Download is too slow.

Hello! Thanks for the awesome project and releasing the models and data !
By the way, I found that downloading rendered images is too slow (about 30KB/s).

Is there any other way to download the dataset?
Thanks !

3D reconstruction with 32GB of GPU RAM

Hi,
I was wondering what GPU the authors use for 3drec/run_zero123.py, and if anyone has been successful running it with <=32GB of GPU RAM? The novel view reconstruction works fine but I'm getting a CUDA out of memory error running the 3D reconstruction script. Thanks and great work!

Load pretrained model

Hi ruoshi,

Thank you for your awesome work. I have a question about training script. In main.py, when load the pretrained SD model sd-image-conditioned-v2.ckpt, the parameters of FrozenCLIPImageEmbedder will not be loaded due to unmatched key names. So do your fine-tuned model load the clip embedding parameters?

Question about Objaverse rendering

Hi, authors of zero123,

I wonder if it is necessary to filter out some objects in Objaverse when creating the dataset since some instances are too weird (e.g. a single paper). And have you filtered the dataset you provide in this repo? Could you give me some suggestions?

Could you please provide the training log for reference?

Hi @ruoshiliu,

Thanks for your excellent contribution, I tried to reproduce the results but cannot make sure whether I am right. Could you please provide the training log for me to have a better look at it? Tensorboard Logs are also needed if you could provide them.

Optimise vram requirements to 16GB

Thanks for this fantastic work!

I appreciate the effort it took to fit the model in 22GB. Would it be possible to squeeze it further down to 16GB? I'd love to be able to run it on my card (rtx 4080).

Extract mesh for SDJ

Hi!
First of all, thank you so much for revealing such a wonderful work.
I've checked this issue but still have a question.
In the paper, it is said that 3D reconstruction was performed with sjc, not nerf.
Even in the 3D reconstruction section of the README, run_zero123.py using sjc is shown as an example.
However, there is no part in the sjc code that brings out 3D.
Can you tell me the reason?

Thanks.

Could you please provide the code for multi-node rendering?

Hi @ruoshiliu,

Thanks for your contribution, I'm trying to quickly test an idea, but I need to re-render the data set. I noticed that your previous answer mentioned using 10 machines to render, can you provide code and tutorials for multi-machine rendering?

RT matrix from rendering

I'm wondering if the RT matrix is the extrinsic matrix of the camera? And I just want to make sure that in the dataset, this matrix is kept in the transforms.json right? I'm confused because it seems that the RT matrix generated when rendering is different from that of GET3D, so I want to make sure that I'm doing the right render process and get the extrinsic matrix of the camera.

Questions on training with less GPUs

Hi, authors thanks for your work. I'm training your repo with 2 A100 and set batch_size = 192, do I have to set accumulate_grad_batches to 4 since you trained on 8 A100s?

Please provide steps for custom dataset, also how to generate the transform.json file

Abnormal Loss curve

Hi, I trained with 4 A100s with gradient accumulation of 2. However, the loss curve seems to be not right:

This is the reconstruction results:

how to train with precision=16

Hi @ruoshiliu,

Have you tried to train the model with half-precision?

As mentioned in #20, I couldn't make it to train with precision=16.

Thanks for your excellent work

Training logs

Hi @ruoshiliu,

I tried to reproduce the results with 32 V100 GPUs (batch size is 12 for each node, and the accumulate_grad_batches is 4 ), could you please help check these losses and reconstruction results?

In addition, I really hope that you can provide the training logs for future research.

LR schedule

Hi! @ruoshiliu

May I ask a question about lr schedular? Currently it seems you are using constant lr=1 after warm up. Is it the optimal schedule you've found? I am asking this question because I am wondering what would be the appropriate lr if we want to fine tune the model without deviating.

Thanks so much for your help!

Source of testing images in the paper

Dear friends! I really appreciate your work and your caution of choosing testing images out of objavserse distribution! May I ask your way of choosing testing objects? (To be more concrete, do you have any advice on choosing 3D object testing dataset so that we can test it with ground truth?)

Where is valid_paths.json

Hi! Happy to read about your excellent work!

May I know where is the file valid_paths.json that is used for training on objaverse? I can only find object-paths.json in the downloaded files.

Thanks.

Regarding the released objaverse renderings

Hi, Thank you for amazing work!

I was able to download the renderings provided in the repository. However, I was not able to find the camera intrinsic (e.g focal length or camera angle) nor the near and far depths of each scene. I wanted to check if I was missing something?

Also I wanted to confirm, the downloadable renderings do not contain the depth maps from each view right?

Thanks

Objaverse Rendering Not Found

Hi, @ruoshiliu

I tried to download the rendered results of the objaverse dataset. However, the following errors occurred:

missing file `zero123/stable-diffusion/configs/sd-objaverse-finetune-c_concat-256.yaml`

Hi @ruoshiliu,

Thanks for sharing the code! When I run the 3drec code, it seems the config is missing?

Objaverse Rendering Dataset

Hello, thanks for fixing the download link !
I successfully download your dataset !

By the way, what is the information of numpy files ?

I checked each npy file include 3x4 matrix (camera pose).
Is it camera extrinsic?
I wonder the camera extrinsic is already preprocessed, assuming the object center is oriented at the origin of coordinates.

Missing files

I run the training script and encounter an error: No such file or directory: 'views_whole_sphere/valid_paths.json'. Please help me fix this problem. Thank you.

Would it be possible to provide a downloadable link for the rendered images?

Hi, thanks for demonstrating this fantastic work!
I have been following the provided rendering code based on your instructions.
However, as mentioned, the rendering speed is extremely slow, even with 8GPUs.
#4

Do you have an estimation of how long the whole process would take to render the objarvse dataset?
Will it be possible to have a downloadable link, such as Google drive, dropbox, etc for your preprocessed dataset (or even a reasonable size subset)?

Thanks!

some question about gradio_new.py

Hello, this is perfect work.
I had a question about the gradio_new.py. Specifically, I want to know the Rotation matrix(camera_R ) is in c2w or w2c format. Please clarify this for me.
Also, I want to know if the camera coordinate system used in gradio_new.py is based on the NerF/OpenGL convention, where the camera faces the negative z-axis, and the positive x-axis points to the right, or others.
Finally, I was wondering if the values of cam_x, cam_y, and cam_z in the gradio_new.py represent the coordinates of the camera in the world coordinate system.
Looking forward to your reply

MD5 checksums for the .ckpt files

Hi, just to validate the files downloaded correctly could you provide md5 checksums for them please?

Windows path issue

This is working great on windows with novel view synthesis, but there is an small path issue with 3d reconstruction:

Loading model from ../zero123/105000.ckpt
Global Step: 165000
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.53 M params.
Keeping EMAs of 688.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Traceback (most recent call last):
  File "C:\Users\user\Desktop\zero123\3drec\run_zero123.py", line 405, in <module>
    dispatch(SJC)
  File "C:\Users\user\Desktop\zero123\3drec\my\config.py", line 76, in dispatch
    mod.run()
  File "C:\Users\user\Desktop\zero123\3drec\run_zero123.py", line 124, in run
    sjc_3d(**cfgs, poser=poser, model=model, vox=vox)
  File "C:\Users\user\Desktop\zero123\3drec\run_zero123.py", line 150, in sjc_3d
    images_, _, poses_, mask_, fov_x = load_blender('train', scene=scene, path=nerf_path)
  File "C:\Users\user\Desktop\zero123\3drec\voxnerf\data.py", line 15, in load_blender
    with open(root / f'transforms_{split}.json', "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: "'data\\nerf_wild'\\'pikachu'\\transforms_train.json"

How to get the image data when training ?

Hello, I want to know how to get the image data for training, I only download the objaverse data from huggingface with the '.glb' file, and when I run the training scripts, It seems it try to load the image from.objaverse/hf-objaverse-v1/692db5f2d3a04bb286cb977a7dba903e_1/002.png, but I do not have these image data with '.png'.

By the way, in line 282 in zero123/zero123/ldm/data/simple.py, the sys is not defined in this file.

cvlab-columbia / zero123 Goto Github PK

zero123's Issues

Recommend Projects

Recommend Topics

Recommend Org