Coder Social home page Coder Social logo

cvlab-columbia / zero123 Goto Github PK

View Code? Open in Web Editor NEW
2.6K 43.0 187.0 40.21 MB

Zero-1-to-3: Zero-shot One Image to 3D Object (ICCV 2023)

Home Page: https://zero123.cs.columbia.edu/

License: MIT License

Python 99.92% Shell 0.08%
image-to-3d novel-view-synthesis single-view-reconstruction stable-diffusion zero-shot

zero123's Introduction

Zero-1-to-3: Zero-shot One Image to 3D Object

ICCV 2023

Hugging Face Spaces

Zero-1-to-3: Zero-shot One Image to 3D Object
Ruoshi Liu1, Rundi Wu1, Basile Van Hoorick1, Pavel Tokmakov2, Sergey Zakharov2, Carl Vondrick1
1Columbia University, 2Toyota Research Institute

Updates

Usage

Novel View Synthesis

conda create -n zero123 python=3.9
conda activate zero123
cd zero123
pip install -r requirements.txt
git clone https://github.com/CompVis/taming-transformers.git
pip install -e taming-transformers/
git clone https://github.com/openai/CLIP.git
pip install -e CLIP/

Download checkpoint under zero123 through one of the following sources:

https://huggingface.co/cvlab/zero123-weights/tree/main
wget https://cv.cs.columbia.edu/zero123/assets/$iteration.ckpt    # iteration = [105000, 165000, 230000, 300000]

Note that we have released 4 model weights: 105000.ckpt, 165000.ckpt, 230000.ckpt, 300000.ckpt. By default, we use 105000.ckpt which is the checkpoint after finetuning 105000 iterations on objaverse. Naturally, checkpoints trained longer tend to overfit to training data and suffer in zero-shot generalization, though we didn't empirically verify this. 300000.ckpt is trained for around 6000 A100 hours.

Run our gradio demo for novel view synthesis:

python gradio_new.py

Note that this app uses around 22 GB of VRAM, so it may not be possible to run it on any GPU.

Training Script (preliminary)

Download image-conditioned stable diffusion checkpoint released by Lambda Labs:
wget https://cv.cs.columbia.edu/zero123/assets/sd-image-conditioned-v2.ckpt

Download and unzip valid_paths.json.zip and move the valid_paths.json file under the view_release folder.

Run training command:

python main.py \
    -t \
    --base configs/sd-objaverse-finetune-c_concat-256.yaml \
    --gpus 0,1,2,3,4,5,6,7 \
    --scale_lr False \
    --num_nodes 1 \
    --seed 42 \
    --check_val_every_n_epoch 10 \
    --finetune_from sd-image-conditioned-v2.ckpt

Note that this training script is set for an 8-GPU system, each with 80GB of VRAM. As discussed in the paper, empirically the large batch size is very important for "stably" training stable diffusion. If you have smaller GPUs, consider using smaller batch size and gradient accumulation to obtain a similar effective batch size. Please check this thread for the train/val split we used in the paper.

Dataset (Objaverse Renderings)

Download our objaverse renderings with:

wget https://tri-ml-public.s3.amazonaws.com/datasets/views_release.tar.gz

Disclaimer: note that the renderings are generated with Objaverse. The renderings as a whole are released under the ODC-By 1.0 license. The licenses for the renderings of individual objects are released under the same license creative commons that they are in Objaverse.

3D Reconstruction (SDS)

Check out Stable-Dreamfusion

3D Reconstruction (SJC)

Note that we haven't extensively tuned the hyperparameters for 3D recosntruction. Feel free to explore and play around!

cd 3drec
pip install -r requirements.txt
python run_zero123.py \
    --scene pikachu \
    --index 0 \
    --n_steps 10000 \
    --lr 0.05 \
    --sd.scale 100.0 \
    --emptiness_weight 0 \
    --depth_smooth_weight 10000. \
    --near_view_weight 10000. \
    --train_view True \
    --prefix "experiments/exp_wild" \
    --vox.blend_bg_texture False \
    --nerf_path "data/nerf_wild"
  • You can see results under: 3drec/experiments/exp_wild/$EXP_NAME.

  • To export a mesh from the trained Voxel NeRF with marching cube, use the export_mesh function. For example, add a line:

    vox.export_mesh($PATH_TO_EXPORT)

    under the evaluate function.

  • The dataset is formatted in the same way as NeRF for the convenience of dataloading. In reality, the recommended input in addition to the input image is an estimate of the elevation angle of the image (e.g. if the image is taken from top, the angle is 0, front is 90, bottom is 180). This is hard-coded now to the extrinsics matrix in transforms_train.json

  • We tested the installation processes on a system with Ubuntu 20.04 with an NVIDIA GPU with Ampere architecture.

Discussion on Janus Problem

The design of our method fundamentally alleviates the Janus problem as shown in the 3D reconstruction results above and many results in the Stable-Dreamfusion repo. By modeling camera perspective in an explicit way and training on a large-scale high-quality synthetic dataset where we can obtain ground truth for everything, the ambiguity and bias of viewpoint existing in text-to-image model is significantly alleviated.

This is also related to the prompting tricks used in DreamFusion where prompts like "a back view of" is inserted at the beginning of the text prompt. Zero-1-to-3 models such change of viewpoint explicitly and finetune on Objaverse to ensure both consistency after viewpoint change and accuracy of queried viewpoint.

Acknowledgement

This repository is based on Stable Diffusion, Objaverse, and SJC. We would like to thank the authors of these work for publicly releasing their code. We would like to thank the authors of NeRDi and SJC for their helpful feedback.

We would like to thank Changxi Zheng and Chengzhi Mao for many helpful discussions. This research is based on work partially supported by the Toyota Research Institute, the DARPA MCS program under Federal Agreement No. N660011924032, and the NSF NRI Award #1925157.

Citation

@misc{liu2023zero1to3,
      title={Zero-1-to-3: Zero-shot One Image to 3D Object}, 
      author={Ruoshi Liu and Rundi Wu and Basile Van Hoorick and Pavel Tokmakov and Sergey Zakharov and Carl Vondrick},
      year={2023},
      eprint={2303.11328},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

zero123's People

Contributors

attashe avatar basilevh avatar chriswu1997 avatar ruoshiliu avatar ryanrussell avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zero123's Issues

Missing files

I run the training script and encounter an error: No such file or directory: 'views_whole_sphere/valid_paths.json'. Please help me fix this problem. Thank you.

question about SJC-I

Hi, thanks for the great work!

I saw in the paper, one baseline called SJC-I was mentioned:"Finally, we adapted SJC [53], a diffusion-based text-to-3D model where the original text-conditioned diffusion model is replaced with an image-conditioned diffusion model, which we termed SJC-I". Just wondering what does the "image-conditioned diffusion model" refer to? (I guess it's not your finetuned view-conditioned diffusion model, right?)

Thanks!

CUDA out of memory with a 3090

When I try to run gradio_new.py I'm getting a CUDA OOM error, despite having >23GB of available memory. Is there something in the config I can tweak to overcome this?

Extract mesh for SDJ

Hi!
First of all, thank you so much for revealing such a wonderful work.
I've checked this issue but still have a question.
In the paper, it is said that 3D reconstruction was performed with sjc, not nerf.
Even in the 3D reconstruction section of the README, run_zero123.py using sjc is shown as an example.
However, there is no part in the sjc code that brings out 3D.
Can you tell me the reason?

Thanks.

Regarding the released objaverse renderings

Hi, Thank you for amazing work!

I was able to download the renderings provided in the repository. However, I was not able to find the camera intrinsic (e.g focal length or camera angle) nor the near and far depths of each scene. I wanted to check if I was missing something?

Also I wanted to confirm, the downloadable renderings do not contain the depth maps from each view right?

Thanks

Objaverse Rendering Dataset

Hello, thanks for fixing the download link !
I successfully download your dataset !

By the way, what is the information of numpy files ?

I checked each npy file include 3x4 matrix (camera pose).
Is it camera extrinsic?
I wonder the camera extrinsic is already preprocessed, assuming the object center is oriented at the origin of coordinates.

A question about camera pose

Hi,
Thanks for the amazing work. I would like to use your released data to do some experiments. I have some visualization examples of projecting the vertices of the 3d object bounding box to the 2d image, but I got some results like this:

image

I set the intrinsic matrix as K = np.asarray([512, 0, 256, 0, 512, 256, 0, 0, 1]).reshape(3, 3), which could be wrong. Is there any normalization process in your implementation? Could you please provide the correct intrinsic matrix?

Looking forward to your reply!

Questions on training with less GPUs

Hi, authors thanks for your work. I'm training your repo with 2 A100 and set batch_size = 192, do I have to set accumulate_grad_batches to 4 since you trained on 8 A100s?

will the training codes also be released?

Thanks for the awesome project !
I wonder whether the training codes of this study are also released or not !

Again, thanks for the awesome and very cool project.

model for human face

as you mentioned 'If you are trying out images of humans, especially faces, note that it is unfortunately not the intended use case. We would encourage trying out images of everyday objects instead, or even artworks.' I am wondering whether it's hard to generate 3d model of human faces or the model just still in training and will be released later?

Unable to train successfully

Hello and thank you for your very nice paper!

I am trying to train a view-conditional network using the code in zero123, but something is going wrong. I am wondering if my command is wrong, or if there is something else that I am missing.

I am using the command:

python main.py --base configs/sd-objaverse-finetune-c_concat-256.yaml --train --gpus=0,1,2,3 precision=16

I have trained for 10,000 steps and it is evident from the generations that something is going wrong. Do you know why this might be / should I be using a different command?

For context, the logged images look as follows:

inputs_gs-000000_e-000000_b-000000:
inputs_gs-000000_e-000000_b-000000

conditioning_gs-000000_e-000000_b-000000:
conditioning_gs-000000_e-000000_b-000000

reconstruction_gs-000000_e-000000_b-000000
reconstruction_gs-000000_e-000000_b-000000

samples_gs-000000_e-000000_b-000000
samples_gs-000000_e-000000_b-000000

samples_cfg_scale_3 00_gs-000000_e-000000_b-000000
samples_cfg_scale_3 00_gs-000000_e-000000_b-000000

Thank you so much for your help!

Would it be possible to provide a downloadable link for the rendered images?

Hi, thanks for demonstrating this fantastic work!
I have been following the provided rendering code based on your instructions.
However, as mentioned, the rendering speed is extremely slow, even with 8GPUs.
#4

Do you have an estimation of how long the whole process would take to render the objarvse dataset?
Will it be possible to have a downloadable link, such as Google drive, dropbox, etc for your preprocessed dataset (or even a reasonable size subset)?

Thanks!

Abnormal Loss curve

Hi, I trained with 4 A100s with gradient accumulation of 2. However, the loss curve seems to be not right:

image

This is the reconstruction results:
image

where is this config ?

Nice work !

configs/sd-objaverse-finetune-c_concat-256.yaml

it can not find the config and corresponding last.ckpt model.

Will it be shared later

Detailed parameters of cameras when generating renders results from Objaverse

Hi, thanks for providing the 1.5T rendered views!

As I'm using this dataset for my research, it occurs that only the transformation matrices are provided, while other camera parameters are missing, such as fov, camera_angle_x, rotation, etc (those from BlenderDataset). Since we are using the objects from Objaverse, where GTs are provided, it would be better to skip the calibration step and use the information from the 3D asset. Therefore, I'm wondering if you could provide more data about how each object is rendered.

Thanks!

Load pretrained model

Hi ruoshi,

Thank you for your awesome work. I have a question about training script. In main.py, when load the pretrained SD model sd-image-conditioned-v2.ckpt, the parameters of FrozenCLIPImageEmbedder will not be loaded due to unmatched key names. So do your fine-tuned model load the clip embedding parameters?

How to test the ckpt?

Hi!
I have seen the updated README about training. Will you also release the command for testing?
Also, may I know how to use the ObjaverseDataModuleFromConfig by myself to have a flexible control, instead of using the trainer.fit()?

Optimise vram requirements to 16GB

Thanks for this fantastic work!

I appreciate the effort it took to fit the model in 22GB. Would it be possible to squeeze it further down to 16GB? I'd love to be able to run it on my card (rtx 4080).

Training logs

Hi @ruoshiliu,

I tried to reproduce the results with 32 V100 GPUs (batch size is 12 for each node, and the accumulate_grad_batches is 4 ), could you please help check these losses and reconstruction results?

In addition, I really hope that you can provide the training logs for future research.

image

image

some question about gradio_new.py

Hello, this is perfect work.
I had a question about the gradio_new.py. Specifically, I want to know the Rotation matrix(camera_R ) is in c2w or w2c format. Please clarify this for me.
Also, I want to know if the camera coordinate system used in gradio_new.py is based on the NerF/OpenGL convention, where the camera faces the negative z-axis, and the positive x-axis points to the right, or others.
Finally, I was wondering if the values of cam_x, cam_y, and cam_z in the gradio_new.py represent the coordinates of the camera in the world coordinate system.
Looking forward to your reply

Some error of the dataloader.

Thanks a lot for releasing this training scripts and data. When I run the training scrips, I get some errors.

We originally extract 100 objects from the dataset, and we run in 2 GPUs with the 32GB of VRAM, we set batch_size=16, num_workers=8 in config/sd-objaverse-finetune-c_concat-256.yaml, and it can run successfully.

However, then we use a larger dataset of 2000 objects, and continue to keep batch_size=16, num_workers=8, then get the error RuntimeError: DataLoader worker (pid(s) 8298) exited unexpectedly. I modify the config/sd-objaverse-finetune-c_concat-256.yaml. We try to set batch_size or num_workers to be smaller, but it still get the same error. We also try to set num_workers=0 and it did not work. Only we set batch_size=1,it did not have this error. So, I want to know where the error may occur, and is there any solution to solve this problem?

Lack of evaluation code and confusion on evaluation dataset

Thanks for the interesting work and for releasing the code. Could you please provide more details on the evalution steps in your experiments ? Additionaly, I noticed that you evaluated the method on the GSO dataset and RTMV dataset, but it seems that the GSO dataset is a subset of RTMV dataset. Could you please clarify the evaluation procees and provide more information on how you evaluated on these two dataset ?

Killed : cd zero123/zero123, then python gradio_new.py, get Killed

cd zero123/zero123, then python gradio_new.py, the program gets Killed.
The code locates in ' zero123/zero123/ldm/models/diffusion/ddpm.py ',


def instantiate_cond_stage(self, config):
    if not self.cond_stage_trainable:
        if config == "__is_first_stage__":
            print("Using first stage also as cond stage.")
            self.cond_stage_model = self.first_stage_model
        elif config == "__is_unconditional__":
            print(f"Training {self.__class__.__name__} as an unconditional model.")
            self.cond_stage_model = None
            # self.be_unconditional = True
        else:
            model = instantiate_from_config(config)

instantiate_from_config gets error, config is ' cond_stage_config:target: ldm.modules.encoders.modules.FrozenCLIPImageEmbedder'.
Did anyone else encounter similar situations?

Where is valid_paths.json

Hi! Happy to read about your excellent work!

May I know where is the file valid_paths.json that is used for training on objaverse? I can only find object-paths.json in the downloaded files.

Thanks.

Comparison version of Point-e

On your webpage you show comparisons in the single-view 3D reconstruction against Point-e. I was wondering which version of Point-e this is - whether you trained your own or used the open source release? And in either case what the specifics were?

Thank you

How many objaverse objects are used when finetuning

hi, first show great respect to this work!
I just wondering how many objaverse objects are used in this project? I see the objaverse dataset is quite large lol.
Perhaps you can provide some tutorial about processing objaverse dataset?(like rendering etc)

Best,

Questions on dataloader

Hi, I appreciate your excellent work. I tried to download your data and ran your train script. However, the released data has no file 'valid_paths.json' (I managed to find it in another issue), is it just all the subfolder names in views_release folder? Also in your dataset code:

if self.paths[index][-2:] == '_1': # dirty fix for rendering dataset twice
            total_view = 8
        else:
            total_view = 4

which means total view for each subfolder is 8 or 4. However, I found in my downloaded data, most scenes have more than 10 views, do you only random from 4 or 8 views during training?

LR schedule

Hi! @ruoshiliu

May I ask a question about lr schedular? Currently it seems you are using constant lr=1 after warm up. Is it the optimal schedule you've found? I am asking this question because I am wondering what would be the appropriate lr if we want to fine tune the model without deviating.

Thanks so much for your help!

Source of testing images in the paper

Dear friends! I really appreciate your work and your caution of choosing testing images out of objavserse distribution! May I ask your way of choosing testing objects? (To be more concrete, do you have any advice on choosing 3D object testing dataset so that we can test it with ground truth?)

Out of memory on 24 GB VRAM GPU

I had some trouble to debug it, but it seems like an OutOfMemory error, as 24.2 GB are filled via 'nvidia-smi', leading to a failure:
cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
or with cuDNN disabled:
cuBLAS error: CUBLAS_STATUS_NOT_INITIALIZED
in the F.conv2D call inside .

If you got it running on a RTX3090, could you share the configuration changes?
I used a 128x128 RBG image as a smallest test.

R matrix from gradio_new.py

I have checked Appendix Section A of the paper regarding the camera coordinate system. In gradio_new.py, Is the Rotation matrix(camera_R ) in w2c format?
In gradio_new.py, you use camera_R to obtain the camera extrinsic but not given T matric of the camera. So I want to use pytorch3d look_at_view_transform to compute
extrinsic matric R|t using: R, T = look_at_view_transform(dist=radius, elev=90 - polar_deg, azim=azimuth_deg, up=((0, 1, 0),)). but I got a different result. In the image below, the result of R_jisuan is used pytorch3d to compute, and the camera_R_ori is your code result. how to compute the same R using pytorch3d in your code. Or you can give me advice on how to use your code to compute the T matric of the camera.
f60ef7890fbf88315aa0a7a45ff71a6
Looking forward to your reply!

CFG scale(guidance scale) for generation

Hi, when playing with the demo live, I found increase the default CFG(scale) will enhance the generated samples quality (less variance and shaper, like increase it to 10+)
image

Just curious about this

Understanding cross attention layer with a single context token

Hi, thanks for sharing this fantastic work.
I am trying to understand the cross attention used in the model. Here, the conditional context only has one token, i.e., the clip embedding concatenated with pose. As a result, the size of the cross attention matrix is [num_spatial_token, 1] and all attention weights would be one. The output is just copying the value vector to each spatial location (or add, if we consider the residual connection). It seems that the K and Q are redundant in this case. Is this the expected behavior?

Windows path issue

This is working great on windows with novel view synthesis, but there is an small path issue with 3d reconstruction:

Loading model from ../zero123/105000.ckpt
Global Step: 165000
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.53 M params.
Keeping EMAs of 688.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Traceback (most recent call last):
  File "C:\Users\user\Desktop\zero123\3drec\run_zero123.py", line 405, in <module>
    dispatch(SJC)
  File "C:\Users\user\Desktop\zero123\3drec\my\config.py", line 76, in dispatch
    mod.run()
  File "C:\Users\user\Desktop\zero123\3drec\run_zero123.py", line 124, in run
    sjc_3d(**cfgs, poser=poser, model=model, vox=vox)
  File "C:\Users\user\Desktop\zero123\3drec\run_zero123.py", line 150, in sjc_3d
    images_, _, poses_, mask_, fov_x = load_blender('train', scene=scene, path=nerf_path)
  File "C:\Users\user\Desktop\zero123\3drec\voxnerf\data.py", line 15, in load_blender
    with open(root / f'transforms_{split}.json', "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: "'data\\nerf_wild'\\'pikachu'\\transforms_train.json"

Mesh extraction in 3D reconstruction

Hi, @ruoshiliu Thank you for making your incredible work public!
I'm looking into the code for 3D reconstruction based on SJC, and wondering if there is a code for extracting the mesh from the generated 3D, like what has been shared on the project page. Or, if there are any libraries or code repositories used, that information would also be of great help. Thank you!

How to get the image data when training ?

Hello, I want to know how to get the image data for training, I only download the objaverse data from huggingface with the '.glb' file, and when I run the training scripts, It seems it try to load the image from.objaverse/hf-objaverse-v1/692db5f2d3a04bb286cb977a7dba903e_1/002.png, but I do not have these image data with '.png'.

By the way, in line 282 in zero123/zero123/ldm/data/simple.py, the sys is not defined in this file.

Question about Objaverse rendering

Hi, authors of zero123,

I wonder if it is necessary to filter out some objects in Objaverse when creating the dataset since some instances are too weird (e.g. a single paper). And have you filtered the dataset you provide in this repo? Could you give me some suggestions?

Data Download is too slow.

Hello! Thanks for the awesome project and releasing the models and data !
By the way, I found that downloading rendered images is too slow (about 30KB/s).

Is there any other way to download the dataset?
Thanks !

3D reconstruction with 32GB of GPU RAM

Hi,
I was wondering what GPU the authors use for 3drec/run_zero123.py, and if anyone has been successful running it with <=32GB of GPU RAM? The novel view reconstruction works fine but I'm getting a CUDA out of memory error running the 3D reconstruction script. Thanks and great work!

RT matrix from rendering

I'm wondering if the RT matrix is the extrinsic matrix of the camera? And I just want to make sure that in the dataset, this matrix is kept in the transforms.json right? I'm confused because it seems that the RT matrix generated when rendering is different from that of GET3D, so I want to make sure that I'm doing the right render process and get the extrinsic matrix of the camera.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.