Coder Social home page Coder Social logo

liheyoung / depth-anything Goto Github PK

View Code? Open in Web Editor NEW
6.5K 49.0 499.0 237.84 MB

[CVPR 2024] Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. Foundation Model for Monocular Depth Estimation

Home Page: https://depth-anything.github.io

License: Apache License 2.0

Python 99.95% Shell 0.05%
depth-estimation image-synthesis metric-depth-estimation monocular-depth-estimation

depth-anything's Introduction

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

Lihe Yang1 · Bingyi Kang2† · Zilong Huang2 · Xiaogang Xu3,4 · Jiashi Feng2 · Hengshuang Zhao1*

1HKU    2TikTok    3CUHK    4ZJU

†project lead *corresponding author

CVPR 2024

Paper PDF Project Page

This work presents Depth Anything, a highly practical solution for robust monocular depth estimation by training on a combination of 1.5M labeled images and 62M+ unlabeled images.

teaser

News

Features of Depth Anything

If you need other features, please first check existing community supports.

  • Relative depth estimation:

    Our foundation models listed here can provide relative depth estimation for any given image robustly. Please refer here for details.

  • Metric depth estimation

    We fine-tune our Depth Anything model with metric depth information from NYUv2 or KITTI. It offers strong capabilities of both in-domain and zero-shot metric depth estimation. Please refer here for details.

  • Better depth-conditioned ControlNet

    We re-train a better depth-conditioned ControlNet based on Depth Anything. It offers more precise synthesis than the previous MiDaS-based ControlNet. Please refer here for details. You can also use our new ControlNet based on Depth Anything in ControlNet WebUI or ComfyUI's ControlNet.

  • Downstream high-level scene understanding

    The Depth Anything encoder can be fine-tuned to downstream high-level perception tasks, e.g., semantic segmentation, 86.2 mIoU on Cityscapes and 59.4 mIoU on ADE20K. Please refer here for details.

Performance

Here we compare our Depth Anything with the previously best MiDaS v3.1 BEiTL-512 model.

Please note that the latest MiDaS is also trained on KITTI and NYUv2, while we do not.

Method Params KITTI NYUv2 Sintel DDAD ETH3D DIODE
AbsRel $\delta_1$ AbsRel $\delta_1$ AbsRel $\delta_1$ AbsRel $\delta_1$ AbsRel $\delta_1$ AbsRel $\delta_1$
MiDaS 345.0M 0.127 0.850 0.048 0.980 0.587 0.699 0.251 0.766 0.139 0.867 0.075 0.942
Ours-S 24.8M 0.080 0.936 0.053 0.972 0.464 0.739 0.247 0.768 0.127 0.885 0.076 0.939
Ours-B 97.5M 0.080 0.939 0.046 0.979 0.432 0.756 0.232 0.786 0.126 0.884 0.069 0.946
Ours-L 335.3M 0.076 0.947 0.043 0.981 0.458 0.760 0.230 0.789 0.127 0.882 0.066 0.952

We highlight the best and second best results in bold and italic respectively (better results: AbsRel $\downarrow$ , $\delta_1 \uparrow$).

Pre-trained models

We provide three models of varying scales for robust relative depth estimation:

Model Params Inference Time on V100 (ms) A100 RTX4090 (TensorRT)
Depth-Anything-Small 24.8M 12 8 3
Depth-Anything-Base 97.5M 13 9 6
Depth-Anything-Large 335.3M 20 13 12

Note that the V100 and A100 inference time (without TensorRT) is computed by excluding the pre-processing and post-processing stages, whereas the last column RTX4090 (with TensorRT) is computed by including these two stages (please refer to Depth-Anything-TensorRT).

You can easily load our pre-trained models by:

from depth_anything.dpt import DepthAnything

encoder = 'vits' # can also be 'vitb' or 'vitl'
depth_anything = DepthAnything.from_pretrained('LiheYoung/depth_anything_{:}14'.format(encoder))

Depth Anything is also supported in transformers. You can use it for depth prediction within 3 lines of code (credit to @niels).

No network connection, cannot load these models?

Click here for solutions
from depth_anything.dpt import DepthAnything

model_configs = {
    'vitl': {'encoder': 'vitl', 'features': 256, 'out_channels': [256, 512, 1024, 1024]},
    'vitb': {'encoder': 'vitb', 'features': 128, 'out_channels': [96, 192, 384, 768]},
    'vits': {'encoder': 'vits', 'features': 64, 'out_channels': [48, 96, 192, 384]}
}

encoder = 'vitl' # or 'vitb', 'vits'
depth_anything = DepthAnything(model_configs[encoder])
depth_anything.load_state_dict(torch.load(f'./checkpoints/depth_anything_{encoder}14.pth'))

Note that in this locally loading manner, you also do not have to install the huggingface_hub package. In this way, please feel free to delete this line and the PyTorchModelHubMixin in this line.

Usage

Installation

git clone https://github.com/LiheYoung/Depth-Anything
cd Depth-Anything
pip install -r requirements.txt

Running

python run.py --encoder <vits | vitb | vitl> --img-path <img-directory | single-img | txt-file> --outdir <outdir> [--pred-only] [--grayscale]

Arguments:

  • --img-path: you can either 1) point it to an image directory storing all interested images, 2) point it to a single image, or 3) point it to a text file storing all image paths.
  • --pred-only is set to save the predicted depth map only. Without it, by default, we visualize both image and its depth map side by side.
  • --grayscale is set to save the grayscale depth map. Without it, by default, we apply a color palette to the depth map.

For example:

python run.py --encoder vitl --img-path assets/examples --outdir depth_vis

If you want to use Depth Anything on videos:

python run_video.py --encoder vitl --video-path assets/examples_video --outdir video_depth_vis

Gradio demo

To use our gradio demo locally:

python app.py

You can also try our online demo.

Import Depth Anything to your project

If you want to use Depth Anything in your own project, you can simply follow run.py to load our models and define data pre-processing.

Code snippet (note the difference between our data pre-processing and that of MiDaS)
from depth_anything.dpt import DepthAnything
from depth_anything.util.transform import Resize, NormalizeImage, PrepareForNet

import cv2
import torch
from torchvision.transforms import Compose

encoder = 'vits' # can also be 'vitb' or 'vitl'
depth_anything = DepthAnything.from_pretrained('LiheYoung/depth_anything_{:}14'.format(encoder)).eval()

transform = Compose([
    Resize(
        width=518,
        height=518,
        resize_target=False,
        keep_aspect_ratio=True,
        ensure_multiple_of=14,
        resize_method='lower_bound',
        image_interpolation_method=cv2.INTER_CUBIC,
    ),
    NormalizeImage(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    PrepareForNet(),
])

image = cv2.cvtColor(cv2.imread('your image path'), cv2.COLOR_BGR2RGB) / 255.0
image = transform({'image': image})['image']
image = torch.from_numpy(image).unsqueeze(0)

# depth shape: 1xHxW
depth = depth_anything(image)

Do not want to define image pre-processing or download model definition files?

Easily use Depth Anything through transformers within 3 lines of code! Please refer to these instructions (credit to @niels).

Note: If you encounter KeyError: 'depth_anything', please install the latest transformers from source:

pip install git+https://github.com/huggingface/transformers.git
Click here for a brief demo:
from transformers import pipeline
from PIL import Image

image = Image.open('Your-image-path')
pipe = pipeline(task="depth-estimation", model="LiheYoung/depth-anything-small-hf")
depth = pipe(image)["depth"]

Community Support

We sincerely appreciate all the extensions built on our Depth Anything from the community. Thank you a lot!

Here we list the extensions we have found:

If you have your amazing projects supporting or improving (e.g., speed) Depth Anything, please feel free to drop an issue. We will add them here.

Acknowledgement

We would like to express our deepest gratitude to AK(@_akhaliq) and the awesome HuggingFace team (@niels, @hysts, and @yuvraj) for helping improve the online demo and build the HF models.

Besides, we thank the MagicEdit team for providing some video examples for video depth estimation, and Tiancheng Shen for evaluating the depth maps with MagicEdit.

Citation

If you find this project useful, please consider citing:

@inproceedings{depthanything,
      title={Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data}, 
      author={Yang, Lihe and Kang, Bingyi and Huang, Zilong and Xu, Xiaogang and Feng, Jiashi and Zhao, Hengshuang},
      booktitle={CVPR},
      year={2024}
}

depth-anything's People

Contributors

1ssb avatar liheyoung avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

depth-anything's Issues

Does this work with xformers?

I am working on Unity project to test this model out for converting 2D images and video to 3D (it uses beit currently and works great). It was pretty slow with the 2D to 3D processing of video with this model though (like 2 fps)

so I tried to pip install xformers for a potential speedup (I was hoping), but it says this version of xformers I'm using is not compatible with Torch 2.0.1+cu117 (Cuda 11.7).

The only version of xformers that would work that I found would be version 22 and potentially working with only cuda 11.8 and torch 2.0.1. Is it okay to use cuda 11.8 with this model, I think the Unity project requires 11.7. Does anyone know an old xformers whl that works with Torch 2.0.1+cu117 Also would xformers even make a difference for the framerate? Thanks.

Are there any other parameters in the .py files I can adjust to increase the framerate? It seems like all 3 models run at 2 fps.

Sun RGB-D Dataset

Hello, thanks for the great work!
Could you please share the source from which you downloaded the SunRGBD dataset for evaluation?

Fine-Tune

Will the code for fine-tuning the models be released?
Thank you for your excellent work.

Point cloud from depth map?

Hi, and thanks for making this code available!

Is it possible to calculate a 3d point cloud from the depth map?

Are the depth maps at consistent, or arbitrary scale?

Thanks again

Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

Hi! When I try to run a file app.py or run_video.py, I'm getting this error:

Loading weights from local directory
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
Running on local URL:  http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.
Traceback (most recent call last):
  File "D:\Apps\AiApps\xdrive\DepthGen\miniconda3\lib\site-packages\gradio\queueing.py", line 495, in call_prediction
    output = await route_utils.call_process_api(
  File "D:\Apps\AiApps\xdrive\DepthGen\miniconda3\lib\site-packages\gradio\route_utils.py", line 232, in call_process_api
    output = await app.get_blocks().process_api(
  File "D:\Apps\AiApps\xdrive\DepthGen\miniconda3\lib\site-packages\gradio\blocks.py", line 1561, in process_api
    result = await self.call_function(
  File "D:\Apps\AiApps\xdrive\DepthGen\miniconda3\lib\site-packages\gradio\blocks.py", line 1179, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "D:\Apps\AiApps\xdrive\DepthGen\miniconda3\lib\site-packages\anyio\to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "D:\Apps\AiApps\xdrive\DepthGen\miniconda3\lib\site-packages\anyio\_backends\_asyncio.py", line 2134, in run_sync_in_worker_thread
    return await future
  File "D:\Apps\AiApps\xdrive\DepthGen\miniconda3\lib\site-packages\anyio\_backends\_asyncio.py", line 851, in run
    result = context.run(func, *args)
  File "D:\Apps\AiApps\xdrive\DepthGen\miniconda3\lib\site-packages\gradio\utils.py", line 678, in wrapper
    response = f(*args, **kwargs)
  File "app.py", line 75, in on_submit
    depth = predict_depth(model, image)
  File "D:\Apps\AiApps\xdrive\DepthGen\miniconda3\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "app.py", line 52, in predict_depth
    return model(image)
  File "D:\Apps\AiApps\xdrive\DepthGen\miniconda3\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Apps\AiApps\xdrive\DepthGen\miniconda3\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Apps\AiApps\xdrive\DepthGen\Depth-Anything\depth_anything\dpt.py", line 158, in forward
    features = self.pretrained.get_intermediate_layers(x, 4, return_class_token=True)
  File "D:\Apps\AiApps\xdrive\DepthGen\Depth-Anything\torchhub/facebookresearch_dinov2_main\vision_transformer.py", line 308, in get_intermediate_layers
    outputs = self._get_intermediate_layers_not_chunked(x, n)
  File "D:\Apps\AiApps\xdrive\DepthGen\Depth-Anything\torchhub/facebookresearch_dinov2_main\vision_transformer.py", line 272, in _get_intermediate_layers_not_chunked
    x = self.prepare_tokens_with_masks(x)
  File "D:\Apps\AiApps\xdrive\DepthGen\Depth-Anything\torchhub/facebookresearch_dinov2_main\vision_transformer.py", line 214, in prepare_tokens_with_masks
    x = self.patch_embed(x)
  File "D:\Apps\AiApps\xdrive\DepthGen\miniconda3\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Apps\AiApps\xdrive\DepthGen\miniconda3\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Apps\AiApps\xdrive\DepthGen\Depth-Anything\torchhub/facebookresearch_dinov2_main\dinov2\layers\patch_embed.py", line 76, in forward
    x = self.proj(x)  # B C H W
  File "D:\Apps\AiApps\xdrive\DepthGen\miniconda3\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Apps\AiApps\xdrive\DepthGen\miniconda3\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Apps\AiApps\xdrive\DepthGen\miniconda3\lib\site-packages\torch\nn\modules\conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "D:\Apps\AiApps\xdrive\DepthGen\miniconda3\lib\site-packages\torch\nn\modules\conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

Windows 10, cuda 11

setup.py?

Hello, nice stuff.
Can you add a setup.py or pyproject.toml? So it can be used as a package (pip install/poetry add) without cloning the repo?

About pretraining code

Hi, thx for your work! Do you plan to release the pretraining code? Like training dataset.

MacOS support

Trying to run inference in MacOS yields xformer errors of not supported operations (e.g. smaller, cutlassF, tritonflashattF, etc.)
Can you add support to running in MacOS (and preferable on Apple Silicone)?

How to recover point cloud from depth map?

hi, thanks for your great work.
I run the run.py on my custom images, and I'm having difficulty when using intrinsics and inferred depth map to recover point cloud. The pixel value of the 16-bit unit depth map does not seem to be relative depth, because I found that the pixel value in the far distance is smaller. Please tell me the correct meaning of the pixel value of the 16-bit depth map (inferred in run.py#L72 just as follows).

Depth-Anything/run.py

Lines 72 to 73 in c3390b8

with torch.no_grad():
depth = depth_anything(image)

Evaluation of relative depth

Hello,

Thanks for your amazing work.

I have a question regarding evaluating relative depth. I see from your paper that the errors are as below:
image
How to reproduce these errors from the pretrained weights that are available? Specifically, is there an eval.py for relative depth evaluation? Or how did you get these numbers?

What I tried doing is :

  1. Convert disparity to depth (by taking the inverse)
  2. Find the median of GT and median of predicted depth and use the ratio to scale the prediction
  3. Compute the errors.

By doing this my errors are in the range of 30-40% which is nowhere close to the near 7-8% that you report.
It would be really helpful if you let me know how you computed these numbers. Thanks in advance 👍

true metric depth values

Hi @LiheYoung ,

This is super impressive work. I used the huggingface deployment to test out the network. I gave it a sample image from a camera with known camera intrinsics and it output a depth map(consider it as disparity as it says on huggingface). I see per pixel values of the depth/disparity map but I do not know how to go about extracting per pixel true metric depth from these. Are the depth maps relative or are they true metric? If they are true metric, then how can I go about extracting per pixel metric depth?

video input

hi, thank you for your excellent work. How can I quickly use video as input?

Training procedure

Dear @LiheYoung ,

First, thank you for sharing such a great work. The feature alignment technique to preserve semantic info is very interesting. I am very interested in the training techniques, so I would like to ask you some questions.

From the sub-section "4.1. Implementation Details", as I understand the training procedure goes like this:

  1. train a teacher model (pretrained weights from Segformer?) T for 20 epochs on labeled images, only horizontal flipping aug is used.
  2. jointly train labeled and unlabeled images
    2.0. Augmentation techniques: horizontal flipping, two forms of perturbations (strong color distortions and CutMix)
    2.1. re-initialize S (that means pretrained weights are from Segformer?)
    2.2. "train a student model to sweep across all unlabeled images for one time": train just one epoch?
    2.3. continue to train the student model for how many epochs?

Could you confirm my understanding is correct? Thanks for your time.

Adding support for the Depth Anything model in X-AnyLabeling

Hi, @LiheYoung, thank you so much for your outstanding work!

I've successfully integrated Depth Anything into X-AnyLabeling, bringing a significant advancement to our efforts. Depth Anything proves to be a highly practical solution for robust monocular depth estimation, trained on a combination of 1.5 million labeled images and over 62 million unlabeled images. This integration enhances X-AnyLabeling, providing a more comprehensive and industrial-grade solution for image data engineering.

It's important to note that this isn't an issue but rather a celebration of the successful integration. Once again, thanks for your exceptional contribution. Looking forward to more successful collaborations in the future!

Inference time is slower than expected

Hi,

Thanks for sharing the work, when I try to run the vitl example in an A100 gpu, I found the inference time settles down to around 120ms rather than 13ms as stated in the repo, is there a reason for this? I provided the experiment I ran below.

Thanks!

import cv2
import numpy as np
import os
import torch
import torch.nn.functional as F
from torchvision.transforms import Compose

from depth_anything.dpt import DepthAnything
from depth_anything.util.transform import Resize, NormalizeImage, PrepareForNet

import matplotlib.pyplot as plt

if __name__ == '__main__':

    os.environ['CUDA_VISIBLE_DEVICES'] = '0'
    
    DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

    gpu_name = torch.cuda.get_device_name(torch.cuda.current_device())
    print(f"GPU being used: {gpu_name}")
    
    encoder = 'vitl'
    depth_anything = DepthAnything.from_pretrained('LiheYoung/depth_anything_{}14'.format(encoder)).to(DEVICE).eval()
    
    total_params = sum(param.numel() for param in depth_anything.parameters())
    print('Total parameters: {:.2f}M'.format(total_params / 1e6))
    
    transform = Compose([
        Resize(
            width=518,
            height=518,
            resize_target=False,
            keep_aspect_ratio=True,
            ensure_multiple_of=14,
            resize_method='lower_bound',
            image_interpolation_method=cv2.INTER_CUBIC,
        ),
        NormalizeImage(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
        PrepareForNet(),
    ])

    filename = "assets/examples/demo1.png"

    raw_image = cv2.imread(filename)
    image = cv2.cvtColor(raw_image, cv2.COLOR_BGR2RGB) / 255.0
    
    h, w = image.shape[:2]
    
    image = transform({'image': image})['image']
    image = torch.from_numpy(image).unsqueeze(0).to(DEVICE)
    
    print(f"image shape: {image.shape}")

    with torch.no_grad():
        import time
        for i in range(1000):
            start = time.perf_counter()
            depth = depth_anything(image)
            print(f"inference time is: {time.perf_counter() - start}s")
GPU being used: NVIDIA A100-SXM4-80GB
Total parameters: 335.32M
image shape: torch.Size([1, 3, 518, 784])
inference time is: 3.4120892197825015s
inference time is: 0.014787798281759024s
inference time is: 0.01355740800499916s
inference time is: 0.10093897487968206s
inference time is: 0.12020917888730764s
inference time is: 0.11985550913959742s
inference time is: 0.12007139809429646s
inference time is: 0.1200293991714716s
inference time is: 0.12007084907963872s
inference time is: 0.12004875903949142s
inference time is: 0.12011446803808212s

Depth_Anything inside PRISMA

I added this model to PRISMA. A multi-infer project that let you make depth inferences along MiDAS v3.1, ZoeDepth, PatchFusion and Marigold; and segmentation of dynamic objects on a scene.

https://github.com/patriciogonzalezvivo/prisma

Depth is encoded in such a way that's easy to be load in Blender or any other rendering engine for visualization

For videos is also possible to do RAFT and COLMAP.

Could you release a tiny/nano version of depth anything pth file?

Hello!
Thank you for your amazing job!
I have converted depthanything_vits14 into onnxruntime(cpu) version (to make it deploy on lite-device, for example, Raspberry Pi/CPU of laptop, easily)
For now, I test depthanything_vits14.onnx on x64 cpu of my laptop, and fps=3.5 (for 518518, pytorch)
I quant this onnx, the fps=3.7fps (for 518
518, quant-static-onnxruntime)
I want to fps>=24fps, but depthanything_vits14 doesn't achieve that. And I think it size of params is too heavy for lite-CPU?
I do like your job, I want to deploy it on CPU so that make this wonderful thing more powerful.
Could you release a tiny/nano version of depth anything pth file to make it faster on CPU?
Thank you in advance!

Cannot load depth_anything_vitl14.pth to finetune metric depth

Cannot load depth_anything_vitl14.pth to finetune metric depth,Any suggestion?Error is:
RuntimeError: Error(s) in loading state_dict for ZoeDepth:
Missing key(s) in state_dict: "core.core.pretrained.cls_token", "core.core.pretrained.pos_embed", "core.core.pretrained.mask_token", "core.core.pretrained.patch_embed.proj.weight", "core.core.pretrained.patch_embed.proj.bias", "core.core.pretrained.blocks.0.norm1.weight", "core.core.pretrained.blocks.0.norm1.bias", "core.core.pretrained.blocks.0.attn.qkv.weight", "core.core.pretrained.blocks.0.attn.qkv.bias", "core.core.pretrained.blocks.0.attn.proj.weight", "core.core.pretrained.blocks.0.attn.proj.bias", "core.core.pretrained.blocks.0.ls1.gamma", "core.core.pretrained.blocks.0.norm2.weight", "core.core.pretrained.blocks.0.norm2.bias", "core.core.pretrained.blocks.0.mlp.fc1.weight", "core.core.pretrained.blocks.0.mlp.fc1.bias", "core.core.pretrained.blocks.0.mlp.fc2.weight", "core.core.pretrained.blocks.0.mlp.fc2.bias", "core.core.pretrained.blocks.0.ls2.gamma", "core.core.pretrained.blocks.1.norm1.weight", "core.core.pretrained.blocks.1.norm1.bias", "core.core.pretrained.blocks.1.attn.qkv.weight", "core.core.pretrained.blocks.1.attn.qkv.bias", "core.core.pretrained.blocks.1.attn.proj.weight", "core.core.pretrained.blocks.1.attn.proj.bias", "core.core.pretrained.blocks.1.ls1.gamma", "core.core.pretrained.blocks.1.norm2.weight", "core.core.pretrained.blocks.1.norm2.bias", "core.core.pretrained.blocks.1.mlp.fc1.weight", "core.core.pretrained.blocks.1.mlp.fc1.bias", "core.core.pretrained.blocks.1.mlp.fc2.weight", "core.core.pretrained.blocks.1.mlp.fc2.bias", "core.core.pretrained.blocks.1.ls2.gamma", "core.core.pretrained.blocks.2.norm1.weight", "core.core.pretrained.blocks.2.norm1.bias", "core.core.pretrained.blocks.2.attn.qkv.weight", "core.core.pretrained.blocks.2.attn.qkv.bias", "core.core.pretrained.blocks.2.attn.proj.weight", "core.core.pretrained.blocks.2.attn.proj.bias", "core.core.pretrained.blocks.2.ls1.gamma", "core.core.pretrained.blocks.2.norm2.weight", "core.core.pretrained.blocks.2.norm2.bias", "core.core.pretrained.blocks.2.mlp.fc1.weight", "core.core.pretrained.blocks.2.mlp.fc1.bias", "core.core.pretrained.blocks.2.mlp.fc2.weight", "core.core.pretrained.blocks.2.mlp.fc2.bias", "core.core.pretrained.blocks.2.ls2.gamma", "core.core.pretrained.blocks.3.norm1.weight", "core.core.pretrained.blocks.3.norm1.bias", "core.core.pretrained.blocks.3.attn.qkv.weight", "core.core.pretrained.blocks.3.attn.qkv.bias", "core.core.pretrained.blocks.3.attn.proj.weight", "core.core.pretrained.blocks.3.attn.proj.bias", "core.core.pretrained.blocks.3.ls1.gamma", "core.core.pretrained.blocks.3.norm2.weight", "core.core.pretrained.blocks.3.norm2.bias", "core.core.pretrained.blocks.3.mlp.fc1.weight", "core.core.pretrained.blocks.3.mlp.fc1.bias", "core.core.pretrained.blocks.3.mlp.fc2.weight", "core.core.pretrained.blocks.3.mlp.fc2.bias", "core.core.pretrained.blocks.3.ls2.gamma", "core.core.pretrained.blocks.4.norm1.weight", "core.core.pretrained.blocks.4.norm1.bias", "core.core.pretrained.blocks.4.attn.qkv.weight", "core.core.pretrained.blocks.4.attn.qkv.bias", "core.core.pretrained.blocks.4.attn.proj.weight", "core.core.pretrained.blocks.4.attn.proj.bias", "core.core.pretrained.blocks.4.ls1.gamma", "core.core.pretrained.blocks.4.norm2.weight", "core.core.pretrained.blocks.4.norm2.bias", "core.core.pretrained.blocks.4.mlp.fc1.weight", "core.core.pretrained.blocks.4.mlp.fc1.bias", "core.core.pretrained.blocks.4.mlp.fc2.weight", "core.core.pretrained.blocks.4.mlp.fc2.bias", "core.core.pretrained.blocks.4.ls2.gamma", "core.core.pretrained.blocks.5.norm1.weight", "core.core.pretrained.blocks.5.norm1.bias", "core.core.pretrained.blocks.5.attn.qkv.weight", "core.core.pretrained.blocks.5.attn.qkv.bias", "core.core.pretrained.blocks.5.attn.proj.weight", "core.core.pretrained.blocks.5.attn.proj.bias", "core.core.pretrained.blocks.5.ls1.gamma", "core.core.pretrained.blocks.5.norm2.weight", "core.core.pretrained.blocks.5.norm2.bias", "core.core.pretrained.blocks.5.mlp.fc1.weight", "core.core.pretrained.blocks.5.mlp.fc1.bias", "core.core.pretrained.blocks.5.mlp.fc2.weight", "core.core.pretrained.blocks.5.mlp.fc2.bias", "core.core.pretrained.blocks.5.ls2.gamma", "core.core.pretrained.blocks.6.norm1.weight", "core.core.pretrained.blocks.6.norm1.bias", "core.core.pretrained.blocks.6.attn.qkv.weight", "core.core.pretrained.blocks.6.attn.qkv.bias", "core.core.pretrained.blocks.6.attn.proj.weight", "core.core.pretrained.blocks.6.attn.proj.bias", "core.core.pretrained.blocks.6.ls1.gamma", "core.core.pretrained.blocks.6.norm2.weight", "core.core.pretrained.blocks.6.norm2.bias", "core.core.pretrained.blocks.6.mlp.fc1.weight", "core.core.pretrained.blocks.6.mlp.fc1.bias", "core.core.pretrained.blocks.6.mlp.fc2.weight", "core.core.pretrained.blocks.6.mlp.fc2.bias", "core.core.pretrained.blocks.6.ls2.gamma", "core.core.pretrained.blocks.7.norm1.weight", "core.core.pretrained.blocks.7.norm1.bias", "core.core.pretrained.blocks.7.attn.qkv.weight", "core.core.pretrained.blocks.7.attn.qkv.bias", "core.core.pretrained.blocks.7.attn.proj.weight", "core.core.pretrained.blocks.7.attn.proj.bias", "core.core.pretrained.blocks.7.ls1.gamma", "core.core.pretrained.blocks.7.norm2.weight", "core.core.pretrained.blocks.7.norm2.bias", "core.core.pretrained.blocks.7.mlp.fc1.weight", "core.core.pretrained.blocks.7.mlp.fc1.bias", "core.core.pretrained.blocks.7.mlp.fc2.weight", "core.core.pretrained.blocks.7.mlp.fc2.bias", "core.core.pretrained.blocks.7.ls2.gamma", "core.core.pretrained.blocks.8.norm1.weight", "core.core.pretrained.blocks.8.norm1.bias", "core.core.pretrained.blocks.8.attn.qkv.weight", "core.core.pretrained.blocks.8.attn.qkv.bias", "core.core.pretrained.blocks.8.attn.proj.weight", "core.core.pretrained.blocks.8.attn.proj.bias", "core.core.pretrained.blocks.8.ls1.gamma", "core.core.pretrained.blocks.8.norm2.weight", "core.core.pretrained.blocks.8.norm2.bias", "core.core.pretrained.blocks.8.mlp.fc1.weight", "core.core.pretrained.blocks.8.mlp.fc1.bias", "core.core.pretrained.blocks.8.mlp.fc2.weight", "core.core.pretrained.blocks.8.mlp.fc2.bias", "core.core.pretrained.blocks.8.ls2.gamma", "core.core.pretrained.blocks.9.norm1.weight", "core.core.pretrained.blocks.9.norm1.bias", "core.core.pretrained.blocks.9.attn.qkv.weight", "core.core.pretrained.blocks.9.attn.qkv.bias", "core.core.pretrained.blocks.9.attn.proj.weight", "core.core.pretrained.blocks.9.attn.proj.bias", "core.core.pretrained.blocks.9.ls1.gamma", "core.core.pretrained.blocks.9.norm2.weight", "core.core.pretrained.blocks.9.norm2.bias", "core.core.pretrained.blocks.9.mlp.fc1.weight", "core.core.pretrained.blocks.9.mlp.fc1.bias", "core.core.pretrained.blocks.9.mlp.fc2.weight", "core.core.pretrained.blocks.9.mlp.fc2.bias", "core.core.pretrained.blocks.9.ls2.gamma", "core.core.pretrained.blocks.10.norm1.weight", "core.core.pretrained.blocks.10.norm1.bias", "core.core.pretrained.blocks.10.attn.qkv.weight", "core.core.pretrained.blocks.10.attn.qkv.bias", "core.core.pretrained.blocks.10.attn.proj.weight", "core.core.pretrained.blocks.10.attn.proj.bias", "core.core.pretrained.blocks.10.ls1.gamma", "core.core.pretrained.blocks.10.norm2.weight", "core.core.pretrained.blocks.10.norm2.bias", "core.core.pretrained.blocks.10.mlp.fc1.weight", "core.core.pretrained.blocks.10.mlp.fc1.bias", "core.core.pretrained.blocks.10.mlp.fc2.weight", "core.core.pretrained.blocks.10.mlp.fc2.bias", "core.core.pretrained.blocks.10.ls2.gamma", "core.core.pretrained.blocks.11.norm1.weight", "core.core.pretrained.blocks.11.norm1.bias", "core.core.pretrained.blocks.11.attn.qkv.weight", "core.core.pretrained.blocks.11.attn.qkv.bias", "core.core.pretrained.blocks.11.attn.proj.weight", "core.core.pretrained.blocks.11.attn.proj.bias", "core.core.pretrained.blocks.11.ls1.gamma", "core.core.pretrained.blocks.11.norm2.weight", "core.core.pretrained.blocks.11.norm2.bias", "core.core.pretrained.blocks.11.mlp.fc1.weight", "core.core.pretrained.blocks.11.mlp.fc1.bias", "core.core.pretrained.blocks.11.mlp.fc2.weight", "core.core.pretrained.blocks.11.mlp.fc2.bias", "core.core.pretrained.blocks.11.ls2.gamma", "core.core.pretrained.blocks.12.norm1.weight", "core.core.pretrained.blocks.12.norm1.bias", "core.core.pretrained.blocks.12.attn.qkv.weight", "core.core.pretrained.blocks.12.attn.qkv.bias", "core.core.pretrained.blocks.12.attn.proj.weight", "core.core.pretrained.blocks.12.attn.proj.bias", "core.core.pretrained.blocks.12.ls1.gamma", "core.core.pretrained.blocks.12.norm2.weight", "core.core.pretrained.blocks.12.norm2.bias", "core.core.pretrained.blocks.12.mlp.fc1.weight", "core.core.pretrained.blocks.12.mlp.fc1.bias", "core.core.pretrained.blocks.12.mlp.fc2.weight", "core.core.pretrained.blocks.12.mlp.fc2.bias", "core.core.pretrained.blocks.12.ls2.gamma", "core.core.pretrained.blocks.13.norm1.weight", "core.core.pretrained.blocks.13.norm1.bias", "core.core.pretrained.blocks.13.attn.qkv.weight", "core.core.pretrained.blocks.13.attn.qkv.bias", "core.core.pretrained.blocks.13.attn.proj.weight", "core.core.pretrained.blocks.13.attn.proj.bias", "core.core.pretrained.blocks.13.ls1.gamma", "core.core.pretrained.blocks.13.norm2.weight", "core.core.pretrained.blocks.13.norm2.bias", "core.core.pretrained.blocks.13.mlp.fc1.weight", "core.core.pretrained.blocks.13.mlp.fc1.bias", "core.core.pretrained.blocks.13.mlp.fc2.weight", "core.core.pretrained.blocks.13.mlp.fc2.bias", "core.core.pretrained.blocks.13.ls2.gamma", "core.core.pretrained.blocks.14.norm1.weight", "core.core.pretrained.blocks.14.norm1.bias", "core.core.pretrained.blocks.14.attn.qkv.weight", "core.core.pretrained.blocks.14.attn.qkv.bias", "core.core.pretrained.blocks.14.attn.proj.weight", "core.core.pretrained.blocks.14.attn.proj.bias", "core.core.pretrained.blocks.14.ls1.gamma", "core.core.pretrained.blocks.14.norm2.weight", "core.core.pretrained.blocks.14.norm2.bias", "core.core.pretrained.blocks.14.mlp.fc1.weight", "core.core.pretrained.blocks.14.mlp.fc1.bias", "core.core.pretrained.blocks.14.mlp.fc2.weight", "core.core.pretrained.blocks.14.mlp.fc2.bias", "core.core.pretrained.blocks.14.ls2.gamma", "core.core.pretrained.blocks.15.norm1.weight", "core.core.pretrained.blocks.15.norm1.bias", "core.core.pretrained.blocks.15.attn.qkv.weight", "core.core.pretrained.blocks.15.attn.qkv.bias", "core.core.pretrained.blocks.15.attn.proj.weight", "core.core.pretrained.blocks.15.attn.proj.bias", "core.core.pretrained.blocks.15.ls1.gamma", "core.core.pretrained.blocks.15.norm2.weight", "core.core.pretrained.blocks.15.norm2.bias", "core.core.pretrained.blocks.15.mlp.fc1.weight", "core.core.pretrained.blocks.15.mlp.fc1.bias", "core.core.pretrained.blocks.15.mlp.fc2.weight", "core.core.pretrained.blocks.15.mlp.fc2.bias", "core.core.pretrained.blocks.15.ls2.gamma", "core.core.pretrained.blocks.16.norm1.weight", "core.core.pretrained.blocks.16.norm1.bias", "core.core.pretrained.blocks.16.attn.qkv.weight", "core.core.pretrained.blocks.16.attn.qkv.bias", "core.core.pretrained.blocks.16.attn.proj.weight", "core.core.pretrained.blocks.16.attn.proj.bias", "core.core.pretrained.blocks.16.ls1.gamma", "core.core.pretrained.blocks.16.norm2.weight", "core.core.pretrained.blocks.16.norm2.bias", "core.core.pretrained.blocks.16.mlp.fc1.weight", "core.core.pretrained.blocks.16.mlp.fc1.bias", "core.core.pretrained.blocks.16.mlp.fc2.weight", "core.core.pretrained.blocks.16.mlp.fc2.bias", "core.core.pretrained.blocks.16.ls2.gamma", "core.core.pretrained.blocks.17.norm1.weight", "core.core.pretrained.blocks.17.norm1.bias", "core.core.pretrained.blocks.17.attn.qkv.weight", "core.core.pretrained.blocks.17.attn.qkv.bias", "core.core.pretrained.blocks.17.attn.proj.weight", "core.core.pretrained.blocks.17.attn.proj.bias", "core.core.pretrained.blocks.17.ls1.gamma", "core.core.pretrained.blocks.17.norm2.weight", "core.core.pretrained.blocks.17.norm2.bias", "core.core.pretrained.blocks.17.mlp.fc1.weight", "core.core.pretrained.blocks.17.mlp.fc1.bias", "core.core.pretrained.blocks.17.mlp.fc2.weight", "core.core.pretrained.blocks.17.mlp.fc2.bias", "core.core.pretrained.blocks.17.ls2.gamma", "core.core.pretrained.blocks.18.norm1.weight", "core.core.pretrained.blocks.18.norm1.bias", "core.core.pretrained.blocks.18.attn.qkv.weight", "core.core.pretrained.blocks.18.attn.qkv.bias", "core.core.pretrained.blocks.18.attn.proj.weight", "core.core.pretrained.blocks.18.attn.proj.bias", "core.core.pretrained.blocks.18.ls1.gamma", "core.core.pretrained.blocks.18.norm2.weight", "core.core.pretrained.blocks.18.norm2.bias", "core.core.pretrained.blocks.18.mlp.fc1.weight", "core.core.pretrained.blocks.18.mlp.fc1.bias", "core.core.pretrained.blocks.18.mlp.fc2.weight", "core.core.pretrained.blocks.18.mlp.fc2.bias", "core.core.pretrained.blocks.18.ls2.gamma", "core.core.pretrained.blocks.19.norm1.weight", "core.core.pretrained.blocks.19.norm1.bias", "core.core.pretrained.blocks.19.attn.qkv.weight", "core.core.pretrained.blocks.19.attn.qkv.bias", "core.core.pretrained.blocks.19.attn.proj.weight", "core.core.pretrained.blocks.19.attn.proj.bias", "core.core.pretrained.blocks.19.ls1.gamma", "core.core.pretrained.blocks.19.norm2.weight", "core.core.pretrained.blocks.19.norm2.bias", "core.core.pretrained.blocks.19.mlp.fc1.weight", "core.core.pretrained.blocks.19.mlp.fc1.bias", "core.core.pretrained.blocks.19.mlp.fc2.weight", "core.core.pretrained.blocks.19.mlp.fc2.bias", "core.core.pretrained.blocks.19.ls2.gamma", "core.core.pretrained.blocks.20.norm1.weight", "core.core.pretrained.blocks.20.norm1.bias", "core.core.pretrained.blocks.20.attn.qkv.weight", "core.core.pretrained.blocks.20.attn.qkv.bias", "core.core.pretrained.blocks.20.attn.proj.weight", "core.core.pretrained.blocks.20.attn.proj.bias", "core.core.pretrained.blocks.20.ls1.gamma", "core.core.pretrained.blocks.20.norm2.weight", "core.core.pretrained.blocks.20.norm2.bias", "core.core.pretrained.blocks.20.mlp.fc1.weight", "core.core.pretrained.blocks.20.mlp.fc1.bias", "core.core.pretrained.blocks.20.mlp.fc2.weight", "core.core.pretrained.blocks.20.mlp.fc2.bias", "core.core.pretrained.blocks.20.ls2.gamma", "core.core.pretrained.blocks.21.norm1.weight", "core.core.pretrained.blocks.21.norm1.bias", "core.core.pretrained.blocks.21.attn.qkv.weight", "core.core.pretrained.blocks.21.attn.qkv.bias", "core.core.pretrained.blocks.21.attn.proj.weight", "core.core.pretrained.blocks.21.attn.proj.bias", "core.core.pretrained.blocks.21.ls1.gamma", "core.core.pretrained.blocks.21.norm2.weight", "core.core.pretrained.blocks.21.norm2.bias", "core.core.pretrained.blocks.21.mlp.fc1.weight", "core.core.pretrained.blocks.21.mlp.fc1.bias", "core.core.pretrained.blocks.21.mlp.fc2.weight", "core.core.pretrained.blocks.21.mlp.fc2.bias", "core.core.pretrained.blocks.21.ls2.gamma", "core.core.pretrained.blocks.22.norm1.weight", "core.core.pretrained.blocks.22.norm1.bias", "core.core.pretrained.blocks.22.attn.qkv.weight", "core.core.pretrained.blocks.22.attn.qkv.bias", "core.core.pretrained.blocks.22.attn.proj.weight", "core.core.pretrained.blocks.22.attn.proj.bias", "core.core.pretrained.blocks.22.ls1.gamma", "core.core.pretrained.blocks.22.norm2.weight", "core.core.pretrained.blocks.22.norm2.bias", "core.core.pretrained.blocks.22.mlp.fc1.weight", "core.core.pretrained.blocks.22.mlp.fc1.bias", "core.core.pretrained.blocks.22.mlp.fc2.weight", "core.core.pretrained.blocks.22.mlp.fc2.bias", "core.core.pretrained.blocks.22.ls2.gamma", "core.core.pretrained.blocks.23.norm1.weight", "core.core.pretrained.blocks.23.norm1.bias", "core.core.pretrained.blocks.23.attn.qkv.weight", "core.core.pretrained.blocks.23.attn.qkv.bias", "core.core.pretrained.blocks.23.attn.proj.weight", "core.core.pretrained.blocks.23.attn.proj.bias", "core.core.pretrained.blocks.23.ls1.gamma", "core.core.pretrained.blocks.23.norm2.weight", "core.core.pretrained.blocks.23.norm2.bias", "core.core.pretrained.blocks.23.mlp.fc1.weight", "core.core.pretrained.blocks.23.mlp.fc1.bias", "core.core.pretrained.blocks.23.mlp.fc2.weight", "core.core.pretrained.blocks.23.mlp.fc2.bias", "core.core.pretrained.blocks.23.ls2.gamma", "core.core.pretrained.norm.weight", "core.core.pretrained.norm.bias", "core.core.depth_head.projects.0.weight", "core.core.depth_head.projects.0.bias", "core.core.depth_head.projects.1.weight", "core.core.depth_head.projects.1.bias", "core.core.depth_head.projects.2.weight", "core.core.depth_head.projects.2.bias", "core.core.depth_head.projects.3.weight", "core.core.depth_head.projects.3.bias", "core.core.depth_head.resize_layers.0.weight", "core.core.depth_head.resize_layers.0.bias", "core.core.depth_head.resize_layers.1.weight", "core.core.depth_head.resize_layers.1.bias", "core.core.depth_head.resize_layers.3.weight", "core.core.depth_head.resize_layers.3.bias", "core.core.depth_head.scratch.layer1_rn.weight", "core.core.depth_head.scratch.layer2_rn.weight", "core.core.depth_head.scratch.layer3_rn.weight", "core.core.depth_head.scratch.layer4_rn.weight", "core.core.depth_head.scratch.refinenet1.out_conv.weight", "core.core.depth_head.scratch.refinenet1.out_conv.bias", "core.core.depth_head.scratch.refinenet1.resConfUnit1.conv1.weight", "core.core.depth_head.scratch.refinenet1.resConfUnit1.conv1.bias", "core.core.depth_head.scratch.refinenet1.resConfUnit1.conv2.weight", "core.core.depth_head.scratch.refinenet1.resConfUnit1.conv2.bias", "core.core.depth_head.scratch.refinenet1.resConfUnit2.conv1.weight", "core.core.depth_head.scratch.refinenet1.resConfUnit2.conv1.bias", "core.core.depth_head.scratch.refinenet1.resConfUnit2.conv2.weight", "core.core.depth_head.scratch.refinenet1.resConfUnit2.conv2.bias", "core.core.depth_head.scratch.refinenet2.out_conv.weight", "core.core.depth_head.scratch.refinenet2.out_conv.bias", "core.core.depth_head.scratch.refinenet2.resConfUnit1.conv1.weight", "core.core.depth_head.scratch.refinenet2.resConfUnit1.conv1.bias", "core.core.depth_head.scratch.refinenet2.resConfUnit1.conv2.weight", "core.core.depth_head.scratch.refinenet2.resConfUnit1.conv2.bias", "core.core.depth_head.scratch.refinenet2.resConfUnit2.conv1.weight", "core.core.depth_head.scratch.refinenet2.resConfUnit2.conv1.bias", "core.core.depth_head.scratch.refinenet2.resConfUnit2.conv2.weight", "core.core.depth_head.scratch.refinenet2.resConfUnit2.conv2.bias", "core.core.depth_head.scratch.refinenet3.out_conv.weight", "core.core.depth_head.scratch.refinenet3.out_conv.bias", "core.core.depth_head.scratch.refinenet3.resConfUnit1.conv1.weight", "core.core.depth_head.scratch.refinenet3.resConfUnit1.conv1.bias", "core.core.depth_head.scratch.refinenet3.resConfUnit1.conv2.weight", "core.core.depth_head.scratch.refinenet3.resConfUnit1.conv2.bias", "core.core.depth_head.scratch.refinenet3.resConfUnit2.conv1.weight", "core.core.depth_head.scratch.refinenet3.resConfUnit2.conv1.bias", "core.core.depth_head.scratch.refinenet3.resConfUnit2.conv2.weight", "core.core.depth_head.scratch.refinenet3.resConfUnit2.conv2.bias", "core.core.depth_head.scratch.refinenet4.out_conv.weight", "core.core.depth_head.scratch.refinenet4.out_conv.bias", "core.core.depth_head.scratch.refinenet4.resConfUnit1.conv1.weight", "core.core.depth_head.scratch.refinenet4.resConfUnit1.conv1.bias", "core.core.depth_head.scratch.refinenet4.resConfUnit1.conv2.weight", "core.core.depth_head.scratch.refinenet4.resConfUnit1.conv2.bias", "core.core.depth_head.scratch.refinenet4.resConfUnit2.conv1.weight", "core.core.depth_head.scratch.refinenet4.resConfUnit2.conv1.bias", "core.core.depth_head.scratch.refinenet4.resConfUnit2.conv2.weight", "core.core.depth_head.scratch.refinenet4.resConfUnit2.conv2.bias", "core.core.depth_head.scratch.output_conv1.weight", "core.core.depth_head.scratch.output_conv1.bias", "core.core.depth_head.scratch.output_conv2.0.weight", "core.core.depth_head.scratch.output_conv2.0.bias", "core.core.depth_head.scratch.output_conv2.2.weight", "core.core.depth_head.scratch.output_conv2.2.bias", "conv2.weight", "conv2.bias", "seed_bin_regressor._net.0.weight", "seed_bin_regressor._net.0.bias", "seed_bin_regressor._net.2.weight", "seed_bin_regressor._net.2.bias", "seed_projector._net.0.weight", "seed_projector._net.0.bias", "seed_projector._net.2.weight", "seed_projector._net.2.bias", "projectors.0._net.0.weight", "projectors.0._net.0.bias", "projectors.0._net.2.weight", "projectors.0._net.2.bias", "projectors.1._net.0.weight", "projectors.1._net.0.bias", "projectors.1._net.2.weight", "projectors.1._net.2.bias", "projectors.2._net.0.weight", "projectors.2._net.0.bias", "projectors.2._net.2.weight", "projectors.2._net.2.bias", "projectors.3._net.0.weight", "projectors.3._net.0.bias", "projectors.3._net.2.weight", "projectors.3._net.2.bias", "attractors.0._net.0.weight", "attractors.0._net.0.bias", "attractors.0._net.2.weight", "attractors.0._net.2.bias", "attractors.1._net.0.weight", "attractors.1._net.0.bias", "attractors.1._net.2.weight", "attractors.1._net.2.bias", "attractors.2._net.0.weight", "attractors.2._net.0.bias", "attractors.2._net.2.weight", "attractors.2._net.2.bias", "attractors.3._net.0.weight", "attractors.3._net.0.bias", "attractors.3._net.2.weight", "attractors.3._net.2.bias", "conditional_log_binomial.log_binomial_transform.k_idx", "conditional_log_binomial.log_binomial_transform.K_minus_1", "conditional_log_binomial.mlp.0.weight", "conditional_log_binomial.mlp.0.bias", "conditional_log_binomial.mlp.2.weight", "conditional_log_binomial.mlp.2.bias".
Unexpected key(s) in state_dict: "pretrained.cls_token", "pretrained.pos_embed", "pretrained.mask_token", "pretrained.patch_embed.proj.weight", "pretrained.patch_embed.proj.bias", "pretrained.blocks.0.norm1.weight", "pretrained.blocks.0.norm1.bias", "pretrained.blocks.0.attn.qkv.weight", "pretrained.blocks.0.attn.qkv.bias", "pretrained.blocks.0.attn.proj.weight", "pretrained.blocks.0.attn.proj.bias", "pretrained.blocks.0.ls1.gamma", "pretrained.blocks.0.norm2.weight", "pretrained.blocks.0.norm2.bias", "pretrained.blocks.0.mlp.fc1.weight", "pretrained.blocks.0.mlp.fc1.bias", "pretrained.blocks.0.mlp.fc2.weight", "pretrained.blocks.0.mlp.fc2.bias", "pretrained.blocks.0.ls2.gamma", "pretrained.blocks.1.norm1.weight", "pretrained.blocks.1.norm1.bias", "pretrained.blocks.1.attn.qkv.weight", "pretrained.blocks.1.attn.qkv.bias", "pretrained.blocks.1.attn.proj.weight", "pretrained.blocks.1.attn.proj.bias", "pretrained.blocks.1.ls1.gamma", "pretrained.blocks.1.norm2.weight", "pretrained.blocks.1.norm2.bias", "pretrained.blocks.1.mlp.fc1.weight", "pretrained.blocks.1.mlp.fc1.bias", "pretrained.blocks.1.mlp.fc2.weight", "pretrained.blocks.1.mlp.fc2.bias", "pretrained.blocks.1.ls2.gamma", "pretrained.blocks.2.norm1.weight", "pretrained.blocks.2.norm1.bias", "pretrained.blocks.2.attn.qkv.weight", "pretrained.blocks.2.attn.qkv.bias", "pretrained.blocks.2.attn.proj.weight", "pretrained.blocks.2.attn.proj.bias", "pretrained.blocks.2.ls1.gamma", "pretrained.blocks.2.norm2.weight", "pretrained.blocks.2.norm2.bias", "pretrained.blocks.2.mlp.fc1.weight", "pretrained.blocks.2.mlp.fc1.bias", "pretrained.blocks.2.mlp.fc2.weight", "pretrained.blocks.2.mlp.fc2.bias", "pretrained.blocks.2.ls2.gamma", "pretrained.blocks.3.norm1.weight", "pretrained.blocks.3.norm1.bias", "pretrained.blocks.3.attn.qkv.weight", "pretrained.blocks.3.attn.qkv.bias", "pretrained.blocks.3.attn.proj.weight", "pretrained.blocks.3.attn.proj.bias", "pretrained.blocks.3.ls1.gamma", "pretrained.blocks.3.norm2.weight", "pretrained.blocks.3.norm2.bias", "pretrained.blocks.3.mlp.fc1.weight", "pretrained.blocks.3.mlp.fc1.bias", "pretrained.blocks.3.mlp.fc2.weight", "pretrained.blocks.3.mlp.fc2.bias", "pretrained.blocks.3.ls2.gamma", "pretrained.blocks.4.norm1.weight", "pretrained.blocks.4.norm1.bias", "pretrained.blocks.4.attn.qkv.weight", "pretrained.blocks.4.attn.qkv.bias", "pretrained.blocks.4.attn.proj.weight", "pretrained.blocks.4.attn.proj.bias", "pretrained.blocks.4.ls1.gamma", "pretrained.blocks.4.norm2.weight", "pretrained.blocks.4.norm2.bias", "pretrained.blocks.4.mlp.fc1.weight", "pretrained.blocks.4.mlp.fc1.bias", "pretrained.blocks.4.mlp.fc2.weight", "pretrained.blocks.4.mlp.fc2.bias", "pretrained.blocks.4.ls2.gamma", "pretrained.blocks.5.norm1.weight", "pretrained.blocks.5.norm1.bias", "pretrained.blocks.5.attn.qkv.weight", "pretrained.blocks.5.attn.qkv.bias", "pretrained.blocks.5.attn.proj.weight", "pretrained.blocks.5.attn.proj.bias", "pretrained.blocks.5.ls1.gamma", "pretrained.blocks.5.norm2.weight", "pretrained.blocks.5.norm2.bias", "pretrained.blocks.5.mlp.fc1.weight", "pretrained.blocks.5.mlp.fc1.bias", "pretrained.blocks.5.mlp.fc2.weight", "pretrained.blocks.5.mlp.fc2.bias", "pretrained.blocks.5.ls2.gamma", "pretrained.blocks.6.norm1.weight", "pretrained.blocks.6.norm1.bias", "pretrained.blocks.6.attn.qkv.weight", "pretrained.blocks.6.attn.qkv.bias", "pretrained.blocks.6.attn.proj.weight", "pretrained.blocks.6.attn.proj.bias", "pretrained.blocks.6.ls1.gamma", "pretrained.blocks.6.norm2.weight", "pretrained.blocks.6.norm2.bias", "pretrained.blocks.6.mlp.fc1.weight", "pretrained.blocks.6.mlp.fc1.bias", "pretrained.blocks.6.mlp.fc2.weight", "pretrained.blocks.6.mlp.fc2.bias", "pretrained.blocks.6.ls2.gamma", "pretrained.blocks.7.norm1.weight", "pretrained.blocks.7.norm1.bias", "pretrained.blocks.7.attn.qkv.weight", "pretrained.blocks.7.attn.qkv.bias", "pretrained.blocks.7.attn.proj.weight", "pretrained.blocks.7.attn.proj.bias", "pretrained.blocks.7.ls1.gamma", "pretrained.blocks.7.norm2.weight", "pretrained.blocks.7.norm2.bias", "pretrained.blocks.7.mlp.fc1.weight", "pretrained.blocks.7.mlp.fc1.bias", "pretrained.blocks.7.mlp.fc2.weight", "pretrained.blocks.7.mlp.fc2.bias", "pretrained.blocks.7.ls2.gamma", "pretrained.blocks.8.norm1.weight", "pretrained.blocks.8.norm1.bias", "pretrained.blocks.8.attn.qkv.weight", "pretrained.blocks.8.attn.qkv.bias", "pretrained.blocks.8.attn.proj.weight", "pretrained.blocks.8.attn.proj.bias", "pretrained.blocks.8.ls1.gamma", "pretrained.blocks.8.norm2.weight", "pretrained.blocks.8.norm2.bias", "pretrained.blocks.8.mlp.fc1.weight", "pretrained.blocks.8.mlp.fc1.bias", "pretrained.blocks.8.mlp.fc2.weight", "pretrained.blocks.8.mlp.fc2.bias", "pretrained.blocks.8.ls2.gamma", "pretrained.blocks.9.norm1.weight", "pretrained.blocks.9.norm1.bias", "pretrained.blocks.9.attn.qkv.weight", "pretrained.blocks.9.attn.qkv.bias", "pretrained.blocks.9.attn.proj.weight", "pretrained.blocks.9.attn.proj.bias", "pretrained.blocks.9.ls1.gamma", "pretrained.blocks.9.norm2.weight", "pretrained.blocks.9.norm2.bias", "pretrained.blocks.9.mlp.fc1.weight", "pretrained.blocks.9.mlp.fc1.bias", "pretrained.blocks.9.mlp.fc2.weight", "pretrained.blocks.9.mlp.fc2.bias", "pretrained.blocks.9.ls2.gamma", "pretrained.blocks.10.norm1.weight", "pretrained.blocks.10.norm1.bias", "pretrained.blocks.10.attn.qkv.weight", "pretrained.blocks.10.attn.qkv.bias", "pretrained.blocks.10.attn.proj.weight", "pretrained.blocks.10.attn.proj.bias", "pretrained.blocks.10.ls1.gamma", "pretrained.blocks.10.norm2.weight", "pretrained.blocks.10.norm2.bias", "pretrained.blocks.10.mlp.fc1.weight", "pretrained.blocks.10.mlp.fc1.bias", "pretrained.blocks.10.mlp.fc2.weight", "pretrained.blocks.10.mlp.fc2.bias", "pretrained.blocks.10.ls2.gamma", "pretrained.blocks.11.norm1.weight", "pretrained.blocks.11.norm1.bias", "pretrained.blocks.11.attn.qkv.weight", "pretrained.blocks.11.attn.qkv.bias", "pretrained.blocks.11.attn.proj.weight", "pretrained.blocks.11.attn.proj.bias", "pretrained.blocks.11.ls1.gamma", "pretrained.blocks.11.norm2.weight", "pretrained.blocks.11.norm2.bias", "pretrained.blocks.11.mlp.fc1.weight", "pretrained.blocks.11.mlp.fc1.bias", "pretrained.blocks.11.mlp.fc2.weight", "pretrained.blocks.11.mlp.fc2.bias", "pretrained.blocks.11.ls2.gamma", "pretrained.blocks.12.norm1.weight", "pretrained.blocks.12.norm1.bias", "pretrained.blocks.12.attn.qkv.weight", "pretrained.blocks.12.attn.qkv.bias", "pretrained.blocks.12.attn.proj.weight", "pretrained.blocks.12.attn.proj.bias", "pretrained.blocks.12.ls1.gamma", "pretrained.blocks.12.norm2.weight", "pretrained.blocks.12.norm2.bias", "pretrained.blocks.12.mlp.fc1.weight", "pretrained.blocks.12.mlp.fc1.bias", "pretrained.blocks.12.mlp.fc2.weight", "pretrained.blocks.12.mlp.fc2.bias", "pretrained.blocks.12.ls2.gamma", "pretrained.blocks.13.norm1.weight", "pretrained.blocks.13.norm1.bias", "pretrained.blocks.13.attn.qkv.weight", "pretrained.blocks.13.attn.qkv.bias", "pretrained.blocks.13.attn.proj.weight", "pretrained.blocks.13.attn.proj.bias", "pretrained.blocks.13.ls1.gamma", "pretrained.blocks.13.norm2.weight", "pretrained.blocks.13.norm2.bias", "pretrained.blocks.13.mlp.fc1.weight", "pretrained.blocks.13.mlp.fc1.bias", "pretrained.blocks.13.mlp.fc2.weight", "pretrained.blocks.13.mlp.fc2.bias", "pretrained.blocks.13.ls2.gamma", "pretrained.blocks.14.norm1.weight", "pretrained.blocks.14.norm1.bias", "pretrained.blocks.14.attn.qkv.weight", "pretrained.blocks.14.attn.qkv.bias", "pretrained.blocks.14.attn.proj.weight", "pretrained.blocks.14.attn.proj.bias", "pretrained.blocks.14.ls1.gamma", "pretrained.blocks.14.norm2.weight", "pretrained.blocks.14.norm2.bias", "pretrained.blocks.14.mlp.fc1.weight", "pretrained.blocks.14.mlp.fc1.bias", "pretrained.blocks.14.mlp.fc2.weight", "pretrained.blocks.14.mlp.fc2.bias", "pretrained.blocks.14.ls2.gamma", "pretrained.blocks.15.norm1.weight", "pretrained.blocks.15.norm1.bias", "pretrained.blocks.15.attn.qkv.weight", "pretrained.blocks.15.attn.qkv.bias", "pretrained.blocks.15.attn.proj.weight", "pretrained.blocks.15.attn.proj.bias", "pretrained.blocks.15.ls1.gamma", "pretrained.blocks.15.norm2.weight", "pretrained.blocks.15.norm2.bias", "pretrained.blocks.15.mlp.fc1.weight", "pretrained.blocks.15.mlp.fc1.bias", "pretrained.blocks.15.mlp.fc2.weight", "pretrained.blocks.15.mlp.fc2.bias", "pretrained.blocks.15.ls2.gamma", "pretrained.blocks.16.norm1.weight", "pretrained.blocks.16.norm1.bias", "pretrained.blocks.16.attn.qkv.weight", "pretrained.blocks.16.attn.qkv.bias", "pretrained.blocks.16.attn.proj.weight", "pretrained.blocks.16.attn.proj.bias", "pretrained.blocks.16.ls1.gamma", "pretrained.blocks.16.norm2.weight", "pretrained.blocks.16.norm2.bias", "pretrained.blocks.16.mlp.fc1.weight", "pretrained.blocks.16.mlp.fc1.bias", "pretrained.blocks.16.mlp.fc2.weight", "pretrained.blocks.16.mlp.fc2.bias", "pretrained.blocks.16.ls2.gamma", "pretrained.blocks.17.norm1.weight", "pretrained.blocks.17.norm1.bias", "pretrained.blocks.17.attn.qkv.weight", "pretrained.blocks.17.attn.qkv.bias", "pretrained.blocks.17.attn.proj.weight", "pretrained.blocks.17.attn.proj.bias", "pretrained.blocks.17.ls1.gamma", "pretrained.blocks.17.norm2.weight", "pretrained.blocks.17.norm2.bias", "pretrained.blocks.17.mlp.fc1.weight", "pretrained.blocks.17.mlp.fc1.bias", "pretrained.blocks.17.mlp.fc2.weight", "pretrained.blocks.17.mlp.fc2.bias", "pretrained.blocks.17.ls2.gamma", "pretrained.blocks.18.norm1.weight", "pretrained.blocks.18.norm1.bias", "pretrained.blocks.18.attn.qkv.weight", "pretrained.blocks.18.attn.qkv.bias", "pretrained.blocks.18.attn.proj.weight", "pretrained.blocks.18.attn.proj.bias", "pretrained.blocks.18.ls1.gamma", "pretrained.blocks.18.norm2.weight", "pretrained.blocks.18.norm2.bias", "pretrained.blocks.18.mlp.fc1.weight", "pretrained.blocks.18.mlp.fc1.bias", "pretrained.blocks.18.mlp.fc2.weight", "pretrained.blocks.18.mlp.fc2.bias", "pretrained.blocks.18.ls2.gamma", "pretrained.blocks.19.norm1.weight", "pretrained.blocks.19.norm1.bias", "pretrained.blocks.19.attn.qkv.weight", "pretrained.blocks.19.attn.qkv.bias", "pretrained.blocks.19.attn.proj.weight", "pretrained.blocks.19.attn.proj.bias", "pretrained.blocks.19.ls1.gamma", "pretrained.blocks.19.norm2.weight", "pretrained.blocks.19.norm2.bias", "pretrained.blocks.19.mlp.fc1.weight", "pretrained.blocks.19.mlp.fc1.bias", "pretrained.blocks.19.mlp.fc2.weight", "pretrained.blocks.19.mlp.fc2.bias", "pretrained.blocks.19.ls2.gamma", "pretrained.blocks.20.norm1.weight", "pretrained.blocks.20.norm1.bias", "pretrained.blocks.20.attn.qkv.weight", "pretrained.blocks.20.attn.qkv.bias", "pretrained.blocks.20.attn.proj.weight", "pretrained.blocks.20.attn.proj.bias", "pretrained.blocks.20.ls1.gamma", "pretrained.blocks.20.norm2.weight", "pretrained.blocks.20.norm2.bias", "pretrained.blocks.20.mlp.fc1.weight", "pretrained.blocks.20.mlp.fc1.bias", "pretrained.blocks.20.mlp.fc2.weight", "pretrained.blocks.20.mlp.fc2.bias", "pretrained.blocks.20.ls2.gamma", "pretrained.blocks.21.norm1.weight", "pretrained.blocks.21.norm1.bias", "pretrained.blocks.21.attn.qkv.weight", "pretrained.blocks.21.attn.qkv.bias", "pretrained.blocks.21.attn.proj.weight", "pretrained.blocks.21.attn.proj.bias", "pretrained.blocks.21.ls1.gamma", "pretrained.blocks.21.norm2.weight", "pretrained.blocks.21.norm2.bias", "pretrained.blocks.21.mlp.fc1.weight", "pretrained.blocks.21.mlp.fc1.bias", "pretrained.blocks.21.mlp.fc2.weight", "pretrained.blocks.21.mlp.fc2.bias", "pretrained.blocks.21.ls2.gamma", "pretrained.blocks.22.norm1.weight", "pretrained.blocks.22.norm1.bias", "pretrained.blocks.22.attn.qkv.weight", "pretrained.blocks.22.attn.qkv.bias", "pretrained.blocks.22.attn.proj.weight", "pretrained.blocks.22.attn.proj.bias", "pretrained.blocks.22.ls1.gamma", "pretrained.blocks.22.norm2.weight", "pretrained.blocks.22.norm2.bias", "pretrained.blocks.22.mlp.fc1.weight", "pretrained.blocks.22.mlp.fc1.bias", "pretrained.blocks.22.mlp.fc2.weight", "pretrained.blocks.22.mlp.fc2.bias", "pretrained.blocks.22.ls2.gamma", "pretrained.blocks.23.norm1.weight", "pretrained.blocks.23.norm1.bias", "pretrained.blocks.23.attn.qkv.weight", "pretrained.blocks.23.attn.qkv.bias", "pretrained.blocks.23.attn.proj.weight", "pretrained.blocks.23.attn.proj.bias", "pretrained.blocks.23.ls1.gamma", "pretrained.blocks.23.norm2.weight", "pretrained.blocks.23.norm2.bias", "pretrained.blocks.23.mlp.fc1.weight", "pretrained.blocks.23.mlp.fc1.bias", "pretrained.blocks.23.mlp.fc2.weight", "pretrained.blocks.23.mlp.fc2.bias", "pretrained.blocks.23.ls2.gamma", "pretrained.norm.weight", "pretrained.norm.bias", "depth_head.projects.0.weight", "depth_head.projects.0.bias", "depth_head.projects.1.weight", "depth_head.projects.1.bias", "depth_head.projects.2.weight", "depth_head.projects.2.bias", "depth_head.projects.3.weight", "depth_head.projects.3.bias", "depth_head.resize_layers.0.weight", "depth_head.resize_layers.0.bias", "depth_head.resize_layers.1.weight", "depth_head.resize_layers.1.bias", "depth_head.resize_layers.3.weight", "depth_head.resize_layers.3.bias", "depth_head.scratch.layer1_rn.weight", "depth_head.scratch.layer2_rn.weight", "depth_head.scratch.layer3_rn.weight", "depth_head.scratch.layer4_rn.weight", "depth_head.scratch.refinenet1.out_conv.weight", "depth_head.scratch.refinenet1.out_conv.bias", "depth_head.scratch.refinenet1.resConfUnit1.conv1.weight", "depth_head.scratch.refinenet1.resConfUnit1.conv1.bias", "depth_head.scratch.refinenet1.resConfUnit1.conv2.weight", "depth_head.scratch.refinenet1.resConfUnit1.conv2.bias", "depth_head.scratch.refinenet1.resConfUnit2.conv1.weight", "depth_head.scratch.refinenet1.resConfUnit2.conv1.bias", "depth_head.scratch.refinenet1.resConfUnit2.conv2.weight", "depth_head.scratch.refinenet1.resConfUnit2.conv2.bias", "depth_head.scratch.refinenet2.out_conv.weight", "depth_head.scratch.refinenet2.out_conv.bias", "depth_head.scratch.refinenet2.resConfUnit1.conv1.weight", "depth_head.scratch.refinenet2.resConfUnit1.conv1.bias", "depth_head.scratch.refinenet2.resConfUnit1.conv2.weight", "depth_head.scratch.refinenet2.resConfUnit1.conv2.bias", "depth_head.scratch.refinenet2.resConfUnit2.conv1.weight", "depth_head.scratch.refinenet2.resConfUnit2.conv1.bias", "depth_head.scratch.refinenet2.resConfUnit2.conv2.weight", "depth_head.scratch.refinenet2.resConfUnit2.conv2.bias", "depth_head.scratch.refinenet3.out_conv.weight", "depth_head.scratch.refinenet3.out_conv.bias", "depth_head.scratch.refinenet3.resConfUnit1.conv1.weight", "depth_head.scratch.refinenet3.resConfUnit1.conv1.bias", "depth_head.scratch.refinenet3.resConfUnit1.conv2.weight", "depth_head.scratch.refinenet3.resConfUnit1.conv2.bias", "depth_head.scratch.refinenet3.resConfUnit2.conv1.weight", "depth_head.scratch.refinenet3.resConfUnit2.conv1.bias", "depth_head.scratch.refinenet3.resConfUnit2.conv2.weight", "depth_head.scratch.refinenet3.resConfUnit2.conv2.bias", "depth_head.scratch.refinenet4.out_conv.weight", "depth_head.scratch.refinenet4.out_conv.bias", "depth_head.scratch.refinenet4.resConfUnit1.conv1.weight", "depth_head.scratch.refinenet4.resConfUnit1.conv1.bias", "depth_head.scratch.refinenet4.resConfUnit1.conv2.weight", "depth_head.scratch.refinenet4.resConfUnit1.conv2.bias", "depth_head.scratch.refinenet4.resConfUnit2.conv1.weight", "depth_head.scratch.refinenet4.resConfUnit2.conv1.bias", "depth_head.scratch.refinenet4.resConfUnit2.conv2.weight", "depth_head.scratch.refinenet4.resConfUnit2.conv2.bias", "depth_head.scratch.output_conv1.weight", "depth_head.scratch.output_conv1.bias", "depth_head.scratch.output_conv2.0.weight", "depth_head.scratch.output_conv2.0.bias", "depth_head.scratch.output_conv2.2.weight", "depth_head.scratch.output_conv2.2.bias".

Color Normalization

I tried both the foundation model and the finetuned metric depth model. I found a slight difference where foundation models are trained and used to infer with color normalization applied to input images, but fine-tuned metric depth models are tuned/ used to infer without color normalization applied (i.e., only divide pixel values by 255.0). Why there is such difference?

Thanks for help!

Running the pretrained model on a directory

Thanks for releasing such an excellent work! Your results look smooth and clean.
However, there might be a small bug in the released code as when I try to run the pretrained model on a directory of images, opencv complains about missing image files:
The command:

python run.py --encoder vitl --load-from checkpoints/depth_anything_vitl14.pth --img-path <input_directory> --outdir <another_directory> --localhub

The output:

Total parameters: 335.32M
  0%|                                                                                                                                                                                                                                                                                              | 0/150 [00:00<?, ?it/s]000000.jpg
[ WARN:[email protected]] global loadsave.cpp:248 findDecoder imread_('000000.jpg'): can't open/read file: check file path/integrity
  0%|                                                                                                                                                                                                                                                                                              | 0/150 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/xuzhen/local/depth-anything/run.py", line 74, in <module>
    image = cv2.cvtColor(raw_image, cv2.COLOR_BGR2RGB) / 255.0
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
cv2.error: OpenCV(4.8.0) /io/opencv/modules/imgproc/src/color.cpp:182: error: (-215:Assertion failed) !_src.empty() in function 'cvtColor'

From the look of it you didn't concatenate the input directory's path after gathering the image files in run.py.
Maybe I'm mistaken but shouldn't there be a line like this after line 69 of run.py:

filenames = [os.path.join(args.img_path, f) for f in filenames]

About Table 6 in the Original Paper

Thanks for your excellent and outstanding efforts in this work! Really love the work.

As for the transferring performance experiments in Table 6, I have some questions about the detailed training settings:

  • Is either ViTS, ViTB or ViTL used for this experiment?
  • Is this model trained from scratch (using DinoV2 original weights and finetuning on speific single dataset) in Table 6, or trained by self-supervised manner on the whole labeled 1.5M and 65M unlabeled dataset? In other words, is there self-supervision manner involved in this experiment settings?

THX a lot for your kind reply and description.

Rgb-depth pairs for fine-tuning

Hi @LiheYoung, I wanted to finetune the metric depth estimation code on my own dataset. I wanted to ask you about the number of pairs of rgb-depth required to get good metrics on metric depth estimation? The NYUV2 dataset have ~36000 pairs of rgb-depth images in the training dataset. Do you think a few 100 or 1000 pairs maybe enough for accurate metric depth estimation?

Unstable results

Hi! Great work!!! very usefull.
One question: is it possible that I have different results in the same image on several occasions? maybe because the interpolation or resize method in transform?

depth map

How can we retrieve the scale factor with the depth map? E.g., how can I obtain relative distances either in real-world units (cm, mm) or relative to pixels (depth difference between point A and B is the same as x pixels wide)?

Training code

Thanks for the amazing work. Will we get training code?

Fine-tune with custom dataset

This is really a fantastic project, thank you for your effort!

Will you provide methods to fine-tuning this model with custom dataset?

Optimal image size and aspect ratio for metric depth estimation

Thank you all for the really awesome code and research! I am interested in metric depth estimation on KITTI. The metric_depth/evaluate.py script drastically changes aspect ratio of input images of KITTI dataset, such that output depth map is 518x392 pixels, whereas input images are 1216x352, this is enabled by passing mode="eval" to get_config() function call. Evaluating the model with changed aspect ration results in higher accuracy:

{'a1': 0.987, 'a2': 0.998, 'a3': 0.999, 'abs_rel': 0.04, 'rmse': 1.824, 'log_10': 0.017, 'rmse_log': 0.062, 'silog': 5.806, 'sq_rel': 0.107}

compared to running depth estimation on full resolution and original aspect ratio (mode="infer"):

{'a1': 0.982, 'a2': 0.997, 'a3': 0.999, 'abs_rel': 0.056, 'rmse': 2.153, 'log_10': 0.025, 'rmse_log': 0.078, 'silog': 6.881, 'sq_rel': 0.152}

I wanted to confirm what is the optimal image size and aspect ratio for metric depth estimation.

Thanks,
Eldar.

How to get depth (inverse disparity)

Hi there, loved your work. I want to ask how to obtain depth (relative distance from the camera to the object). As far as I can tell, Depth Anything produces disparity (not depth). However, I tried to conver disparity to depth by depth = 1/disparity but it does not look correct. Looking forward to your answers!

Segmentation is not working

Thank you for this wonderful work.
I tried to run segmentation as described here: https://github.com/LiheYoung/Depth-Anything/tree/main/semseg

But I could not get it properly working. It has some obvious bugs: In dinov2.py it should be something like
self.dinov2 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
and not as here provided: https://github.com/LiheYoung/Depth-Anything/blob/main/semseg/dinov2.py#L18

Here is another crash I had to fix after following the step you are describing:
https://github.com/open-mmlab/mmsegmentation/blob/main/mmseg/models/segmentors/encoder_decoder.py#L284

Finally, I could make it not crash but the output is very wrong. Any idea what the issue might be as you also report sota on segmentation?

result_demo

Using Metric depth

Hi,

Can I use metric depth checkpoints with provided code snippet

https://github.com/LiheYoung/Depth-Anything/tree/main#import-depth-anything-to-your-project

or should I only use it through the ZoeDepth methods?

from depth_anything.dpt import DPT_DINOv2
from depth_anything.util.transform import Resize, NormalizeImage, PrepareForNet

import cv2
import torch

depth_anything = DPT_DINOv2(encoder='vitl', features=256, out_channels=[256, 512, 1024, 1024], localhub=True)
depth_anything.load_state_dict(torch.load('checkpoints/depth_anything_vitl14.pth'))

transform = Compose([
    Resize(
        width=518,
        height=518,
        resize_target=False,
        keep_aspect_ratio=True,
        ensure_multiple_of=14,
        resize_method='lower_bound',
        image_interpolation_method=cv2.INTER_CUBIC,
    ),
    NormalizeImage(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    PrepareForNet(),
])

image = cv2.cvtColor(cv2.imread('your image path'), cv2.COLOR_BGR2RGB) / 255.0
image = transform({'image': image})['image']
image = torch.from_numpy(image).unsqueeze(0)

# depth shape: 1xHxW
depth = depth_anything(image)

KITTI online test result

Depth Anything is impressive work!

Could you provide the result on KITTI online test split? Thanks!

`RuntimeError: Error(s) in loading state_dict for ZoeDepth` when running `metric_depth/evaluate.py`

THX for your great work. There is a issue when running python metric_depth/evaluate.py -m zoedepth --pretrained_resouce "local::../checkpoints_metric_depth/depth_anything_metric_depth_outdoor.pt" -d kitti. Since the server is not conneted to the Internet, I've downloaded Using pretrained resource url::https://github.com/isl-org/ZoeDepth/releases/download/v1.0/ZoeD_M12_N.pt file and put it in the /root/.cache/torch/hub/checkpoints. However, errors are reported:

Missing key(s) in state_dict: "core.core.pretrained.cls_token", "core.core.pretrained.pos_embed", "core.core.pretrained.mask_token", "core.core.pretrained.patch_embed.proj.weight", "core.core.pretrained.patch_embed.proj.bias", "core.core.pretrained.blocks.0.norm1.weight", "core.core.pretrained.blocks.0.norm1.bias",...

Unexpected key(s) in state_dict: "core.core.scratch.layer1_rn.weight", "core.core.scratch.layer2_rn.weight", "core.core.scratch.layer3_rn.weight", "core.core.scratch.layer4_rn.weight", "core.core.scratch.refinenet1.out_conv.weight", "core.core.scratch.refinenet1.out_conv.bias", "core.core.scratch.refinenet1.resConfUnit1.conv1.weight", "core.core.scratch.refinenet1.resConfUnit1.conv1.bias", "core.core.scratch.refinenet1.resConfUnit1.conv2.weight", "core.core.scratch.refinenet1.resConfUnit1.conv2.bias", "core.core.scratch.refinenet1.resConfUnit2.conv1.weight",..."

I wonder is there any mismatch of downloaded weight and the finetuned model. Hope for your kind suggestions!

metric depth miss the .pth file

Thanks for your such a great job.
When I run evaluate.py, it raised an erro that No such file or directory: '../checkpoints/depth_anything_vitl14.pth' in metric_depth/zoedepth/models/base_models/depth_anything.py

depth_anything = DPT_DINOv2(out_channels=[256, 512, 1024, 1024], use_clstoken=False)
        
state_dict = torch.load('../checkpoints/depth_anything_vitl14.pth', map_location='cpu')

depth_anything.load_state_dict(state_dict)

Where can I find the file depth_anything_vitl14.pth?
On https://huggingface.co/LiheYoung/depth_anything_vitl14/tree/main I just find depth_anything_vitl14.bin.

Better resolution of depthmap?

Hi this work is great! I have modified it a bit to read and write EXR files in float and also add the depth in the proper Z-buffer channel with OpenImageIO. I have been experimenting with trying to get better resolution by changing the resize values and it works to a degree when I run 777 pixels but after that up to when i fill up my vram at 24GB at 2072 it looks much worse.

transform = Compose([
Resize(
width=2072,
height=2072,
Screenshot from 2024-01-26 21-30-59

777 on the left and 2072 to the right

Can I do anything else to get the same nice definition but with crisper edges? Do I need to retrain?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.