dahiyaaneesh / peclr Goto Github PK

This is the pretraining code for PeCLR. An equivariant contrastive learning framework for 3D hand pose estimation. The paper is presented at ICCV 2021.

Home Page: https://ait.ethz.ch/projects/2021/PeCLR/

License: MIT License

Python 100.00%

peclr's Introduction

PeCLR: Self-Supervised 3D Hand Pose Estimation from monocular RGB via Equivariant Contrastive Learning

Paper | Project Page | Blog Post

This is the official repository containing the code for the paper PeCLR: Self-Supervised 3D Hand Pose Estimation from monocular RGB via Equivariant Contrastive Learning.

Installation

The code has been tested on Ubuntu 18.04.5 and python 3.8.10

Setup python environment.

cd path_to_peclr_repo
python3 -m venv ~/peclr_env
source ~/peclr_env/bin/activate

Install pytorch (1.7.0) and other requirements. More info on installation of pytorch 1.7.0 can be found here .

pip install torch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 
pip install -r requirements.txt

Define the environment variables.

export BASE_PATH='<path_to_repo>'
export COMET_API_KEY=''
export COMET_PROJECT=''
export COMET_WORKSPACE=''
export PYTHONPATH="$BASE_PATH"
export DATA_PATH="$BASE_PATH/data/raw/"
export SAVED_MODELS_BASE_PATH="$BASE_PATH/data/models/peclr"
export SAVED_META_INFO_PATH="$BASE_PATH/data/models"

Download FreiHand and youtube3Dhands and extract the datasets into data/raw/freihand_dataset and data/raw/youtube_3d_hands of the main PeCLR directory respectively.

Training

Note: Comet is the logging service used to monitor the training of the models. Setting up comet is optional. It does not affect model training.

Following commands can be used to train the best performing PeCLR model of the main paper.

ResNet-50

python src/experiments/peclr_training.py --color_jitter --random_crop --rotate --crop -resnet_size 50  -sources freihand -sources youtube  --resize   -epochs 100 -batch_size 128  -accumulate_grad_batches 16 -save_top_k 1  -save_period 1   -num_workers 8

Resnet-152

python src/experiments/peclr_training.py --color_jitter --random_crop --rotate --crop -resnet_size 152  -sources freihand -sources youtube  --resize   -epochs 100 -batch_size 128  -accumulate_grad_batches 16 -save_top_k 1  -save_period 1   -num_workers 8

Loading PeCLR weights into a Torchvision ResNet model

The pre-trained PeCLR weights acquired from training can be easily loaded into a ResNet model from torchvision.models. The pre-trained weights can then be used for fine-tuning on labeled datasets.

from src.models.port_model import peclr_to_torchvision
import torchvision


resnet152 = torchvision.models.resnet152(pretrained=True)
peclr_to_torchvision(resnet152, "path_to_peclr_with_resnet_152_base")
# Note: The last 'fc' layer in resnet model is not updated

Pre-trained PeCLR models

We offer ResNet-50 and ResNet-152 pre-trained on FreiHAND and YT3DH using PeCLR. The models can be downloaded here and unpacked via tar:

# Download pre-trained ResNet-50
wget https://dataset.ait.ethz.ch/downloads/guSEovHBpR/peclr_rn50.tar.gz
tar -xvzf peclr_rn50.tar.gz

# Download pre-trained ResNet-152
wget https://dataset.ait.ethz.ch/downloads/guSEovHBpR/peclr_rn152.tar.gz
tar -xvzf peclr_rn152.tar.gz

The models have been converted to torchvision's model description and can be loaded directly:

import torch
import torchvision.models as models
# For ResNet-50
rn50 = models.resnet50()
peclr_weights = torch.load('peclr_rn50_yt3dh_fh.pth')
rn50.load_state_dict(peclr_weights['state_dict'])
# For ResNet-152
rn152 = models.resnet152()
peclr_weights = torch.load('peclr_rn152_yt3dh_fh.pth')
rn152.load_state_dict(peclr_weights['state_dict'])

Fine-tuned PeCLR models

We offer ResNet-50 and ResNet-152 fine-tuned on FreiHAND from the above PeCLR pre-trained weights. The models can be downloaded here and unpacked via tar:

# Download fine-tuned ResNet-50
wget https://dataset.ait.ethz.ch/downloads/guSEovHBpR/rn50_peclr_yt3d-fh_pt_fh_ft.tar.gz
tar -xvzf rn50_peclr_yt3d-fh_pt_fh_ft.tar.gz

# Download fine-tuned ResNet-152
wget https://dataset.ait.ethz.ch/downloads/guSEovHBpR/rn152_peclr_yt3d-fh_pt_fh_ft.tar.gz
tar -xvzf rn152_peclr_yt3d-fh_pt_fh_ft.tar.gz

The model weights follow the model description of src/models/rn_25D_wMLPref.py. Thus, one can load them in the following manner:

import torch
from src.models.rn_25D_wMLPref import RN_25D_wMLPref
# For RN50
model_type = 'rn50'
# For RN152
model_type = 'rn152'
model = RN_25D_wMLPref(backend_model=model_type)
model_path = f'{model_type}_peclr_yt3d-fh_pt_fh_ft.pth'
checkpoint = torch.load(model_path)
model.load_state_dict(checkpoint['state_dict'])

These model weights achieve the following performance on the FreiHAND leaderboard:

ResNet-50 + PeCLR
Evaluation 3D KP results:
auc=0.357, mean_kp3d_avg=4.71 cm
Evaluation 3D KP ALIGNED results:
auc=0.860, mean_kp3d_avg=0.71 cm

ResNet-152 + PeCLR
Evaluation 3D KP results:
auc=0.360, mean_kp3d_avg=4.56 cm
Evaluation 3D KP ALIGNED results:
auc=0.868, mean_kp3d_avg=0.66 cm

To reproduce these numbers, execute the following commands

python testing/pred_fh.py --model_path {PATH TO {rn50,rn152}_peclr_yt3d-fh_pt_fh_ft.pth}

This will create a file called pred_{rn50,rn152}.zip which can be uploaded to codalab to produce the results above.

Citation

If this repository has been useful for your project, please cite the following work:

@inproceedings{spurr2021self,
  title={Self-Supervised 3D Hand Pose Estimation from monocular RGB via Contrastive Learning},
  author={Spurr, Adrian and Dahiya, Aneesh and Wang, Xi and Zhang, Xucong and Hilliges, Otmar},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={11230--11239},
  year={2021}
}

If the RN_25D_wMLPref model description was useful for your project, please cite the following works:

@inproceedings{iqbal2018hand,
  title={Hand pose estimation via latent 2.5 d heatmap regression},
  author={Iqbal, Umar and Molchanov, Pavlo and Gall, Thomas Breuel Juergen and Kautz, Jan},
  booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
  pages={118--134},
  year={2018}
}

@inproceedings{spurr2020weakly,
  title={Weakly supervised 3d hand pose estimation via biomechanical constraints},
  author={Spurr, Adrian and Iqbal, Umar and Molchanov, Pavlo and Hilliges, Otmar and Kautz, Jan},
  booktitle={Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XVII 16},
  pages={211--228},
  year={2020},
  organization={Springer}
}

peclr's People

Contributors

Stargazers

Watchers

Forkers

alvinzheng seleucia markson14 yimikai ke1ynoc lily-le thiwankajayasiri liumc14 hly-123 bruinxiong drewzzzz6 leishengsheng alacky pinto0309

peclr's Issues

about effective batch size

Thanks for your excellent work and released code.

I have questions about the effective batch size, which is batch size 128 * accumulated_grad_batch 16= 2048.
Does this mean the model see 128 sample a time and then calculates the gradient, then add all the gradients for each of 16 batches? I think such type of implementation differs from the normal concept of batch size 2048, where model sees 2048 sample at a time and the InfoNCE loss is computed over all 2048 samples, but not over 128 samples.
Besides, I find that the precision is chosen to be 16 bit. I wonder why is it necessary to not use 32 bit.
In src/models/base_model.py, I find that the warmup_epochs and max_epochs is rescaled by a factor of self.train_iters_per_epoch // self.config.num_of_mini_batch. Why is this rescaling necessary ? If this factor does not equal to 1, the max_epochs in learning rate scheduler does not equal to max_epochs in pl trainer, which I think is not quite reasonable.

I ran the command python src/experiments/peclr_training.py --color_jitter --random_crop --rotate --crop -resnet_size 50 -sources freihand -sources youtube --resize -epochs 100 -batch_size 128 -accumulate_grad_batches 16 -save_top_k 1 -save_period 1 -num_workers 8 for 100 epochs and got a checkpoint file saved at data/models/74c1b59eb8234ad4b9055d4756092e20/checkpoints/epoch=97.ckpt.

How to convert it to a .pth file to reproduce the numbers in the paper?

Thank you for your help!

Questions about the dim of encodings

Hi, your work is so interesting! However, I'm not very clear about the inverse transformations in your paper and codes. I noticed that in your ICCV paper, Eq.3, the dim of 'zi' is mx2, may I ask why the last dim is 2? Or the encoding zi is not a feature vector but the xy coordinates of some important keypoints? And what 'm' means is not very clear in your paper. I hope you could reply!

How to finetune on FreiHAND?

Hi, sorry to bother you. You’ve really done a fantastic job. I think that the idea of translating the image transformation to the latent space is cool. I’m working on my graduation project now, and want to include your experiments in the paper. I wonder whether the finetuning code for FreiHAND Dataset is available? Thanks!
I‘m looking forward to hearing from you soon：D

Reproducing Numbers

Hello,

Thanks a lot for releasing the code. I'm having trouble to reproduce numbers reported in paper. I'm using your pre-trained models. However i can't get the same numbers that you are reporting:

ResNet-50 + PeCLR
Evaluation 3D KP results:
auc=0.357, mean_kp3d_avg=4.71 cm
Evaluation 3D KP ALIGNED results:
auc=0.860, mean_kp3d_avg=0.71 cm

As you describe, i'm loading your model as following:


import torch
import torchvision.models as models
# For ResNet-50
rn50 = models.resnet50()
peclr_weights = torch.load('peclr_rn50_yt3dh_fh.pth')
rn50.load_state_dict(peclr_weights['state_dict'])
# For ResNet-152
rn152 = models.resnet152()
peclr_weights = torch.load('peclr_rn152_yt3dh_fh.pth')
rn152.load_state_dict(peclr_weights['state_dict'])

And then I'm calling "evaluate" function in the evaluation_utils.py file. I'm evaluating on the FH dataset "test" set. Do you have any other code snippet or something to use evaluation besides evaluation_utils.py. There have been some bugs in this file.

Will the finetune code be released?

Hi Adrian,

Thanks for sharing your great work!

Will the code for finetuning the model on FreiHand be released? (Maybe I missed it somewhere.)

Erroneus package in requrements.txt

I was trying to follow the instructions for installing the model and encountered this error:
197.0 ERROR: No matching distribution found for pkg-resources==0.0.0
A quick research surfaced that this seems to be a bug 2 in older versions of virtualenv and that this package should be removed form the file

Has anyone reproduced the training code,I need some help!

question of inverse transformation

Thank you for your work! But I don't quite understand the inverse transformation in your model. This code has the inverse transform of the projection, I wanted to do the inverse transform of the encoded feature, but the dimension correspondence was wrong. So how to do the inverse transform of the encoded feature?

Is my trained loss value reasonable?

Hi,

Could you please help me confirm my trained unsupervised loss function is in a reasonable range?
(Since the supervised training code is not released, I could't verify it on the FreiHand.)

The following is the loss function from the stage of unsupervised training. The backbone is ResNet152. After 100 epochs, the loss is around 3.60.

Thanks so much!

Question about training.

Hi.

Thank you for sharing the code.

I have a question about training the model in the finetunning stage.

I understood that the model uses the joint 2d and relative depth loss to train the hand pose estimator(pretrained encoder + fc layer), but there is other module to estimate the root depth and it seems necessary to train.

But it is not possible to update that module with 2d joint and relative depth loss.

Is it right to use other loss function in addition to existing loss functions (ex> 3d joint loss) to update that module?

Thank you!

Scale is not equivariantly handled

Hi,

Thank you for your great work!

The paper says the proposed approach is equivariant to geometric transformations.
However, checking your code here, only rotation and translation are consideration.

Where is the scale handled?

Can this run on windows?

Hi, thanks for the great work. Can this run on windows? Should I modify the environment variables or something? Sorry, I am not familiar with Linux system.

Loss Weight to Train the RN_25D_wMLPref

Hi, thanks for the released codes, and I have some questions related to the training of the 2.5D hand representation for hand pose estimation.

From the related papers, I consider that for the hand pose estimator the loss has three terms: 1) 2D pixel coordinate on the image plane, 2) the scale normalized relative z, and 3) the scale normalized z root (after refinement). I wonder whether my understanding is correct, and wonder about the value of weight parameters used to balance these three loss terms.

Thanks for your help!

visualize some images

Hello！your work is so interesting! I want to visualize some images, so I use your trained model "rn152_peclr_yt3d-fh_pt_fh_ft.pth" ,the function plot_hand in src/visualization/visualize.py, but I get some terrible results(neglect the axis problem just see the coord key point). How can I get the correct result like the paper show?Thank you for your reply!
The following are my crazy result:

Did anyone tell me how to train in the fine tuning stage?

end-to-end fine-tuning? linear probing?

Hi, thank you for your great work.

I read the paper and tried to analyze the codes, but wasn't able to figure out whether PeCLR is adopting end-to-end fine-tuning or linear probing for evaluating the latent representation.

In the ablation section, the paper says you freeze the encoder, but in other parts of the paper, you use a term "fine-tuning".

augmentation order

Hi,

According to the comments you wrote here, it seems like there has to be a specific order for different augmentations.

Is there any reason for this?
If I change the order, is there bad effects on the accuracy?

Test on Wild Image

Hi, thanks for your nice work and sharing the pretrained model. There are some questions that i am not so clear, and hope for your reply.
Should i run the supervised training to get the final layer for testing,
and how can i get the scale data on wild image?
Another question is regard to the augmentation: is it beneficial to try an extra segmentation augmentation？
Thanks.

why only use RIGHT hand data?

This is a fantastic work, thank you for your contribution.
I have a question about why only use RIGHT hand data? I haven't found any clues from the paper. This is the code snippets I found in peclr/src/data_loader/youtube_loader.py

If I want to add another dataset, shall I flip them to RIGHT hand(or LEFT hand)?