cavalleria / cavaface Goto Github PK

View Code? Open in Web Editor NEW

453.0 24.0 88.0 312 KB

face recognition training project(pytorch)

License: MIT License

Python 99.85% Shell 0.15%

face-recognition pytorch network loss arcnegface arcface resnest attention-irse apex randaugment

cavaface's Introduction

cavaface's People

Contributors

Stargazers

Watchers

Forkers

trantorrepository xsacha felixzhang7 leiyu1980 rexnxiaobai jiangxuehan xr-yang bonseyes-admin derronqi jasonjjl bingranhu miwaliu wuxiaolianggit guidewsp anguoyang careytian0405 sirius-xie fuxuliu lichaomingai shiyuan0806 clhne haoliuhust siemens-aopen dreamerdoremi dev233 jiesonshan yukaizhou guochunhe zhanglaplace monkey2003 young1403 winjia veechiry timebear songhetang bruinxiong linhong00316 zyg11 willforcv zakerifahimeh fishero-xzhang notmorven lrpopeyou wobjtushisui scott-mao huutrinh68 pierrehao xuekunnan lrain-cn vuminhduc97 molehill getuntun saddambinsyed weifengchiu wajdibensaad xiaoye77 sushine connortran216 johnnysclai codingmice phanthanhtrung thinhemb hhhthomas jxcwu arundasan91 monk-gyai jiejie1873491 mirrortower chenhk-chn rexiome perfyperfect xuantong k4r3l yingcy1 halfeee tangdk bmyan attendfov edwardnguyen1705 wenxuefeng3930 zivzone sk-ookami 3dfr-cute-iium aly-kh driver4567 thanhpham1987 cdy-for-grad

cavaface's Issues

SV-X-Softmax usage error

When training with SV-X-Softmax, the following error occurs:

 File "/home/imagus/dev/cavaface.pytorch/head/metrics.py", line 452, in forward
    if self.xtype == 'MV-AM':
  File "/home/imagus/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 594, in __getattr__
    type(self).__name__, name))
AttributeError: 'SVXSoftmax' object has no attribute 'xtype'

EfficientNet_b1 ACC is always 0

Hi, Do you use the pretrained model when train EfficientNet_b1. when I train it without pretrained model, the top1 and top5 ACC is always 0.

Hi,
I tested the released GhostNet_x1.3 on LFW, CFP-FP and AgeDB-30. It's weird that the model gets much better results on CFP-FP and AgeDB-30 than LFW, have you tried these testing sets? What's the gap?

Thanks.

LFW: Acc 0.850 @ Threshold 0.461
CFP-FP: Acc 0.943 @ Threshold 0.179
AgeDB-30: Acc 0.973 @ Threshold 0.231

It seems like that the in_size in efficientnet is useless

ReXnet based results

Hi, big thanks for your great work, 6666!
Have you tried rexnet based models? If yes, could u share the results? Thanks!

Where can i download the models that you have trained ?

hello,nice work for face recognition!have you tried all the loss function that you list?which is best so far?

some problem about megaface evaluation

When I evaluate the Mobilefacenet model, the megaface evaluation works well. But When i evaluate the Ghostnet, it reports error like:

Total Tensors: 4096752 Used Memory: 15.70M
The allocated memory on cuda:0: 77.55M
Memory differs due to the matrix alignment or invisible gradient buffer tensors

min gpu free mem: 8000000000.0 B
min gpu free mem: 8000000000.0 B
min gpu free mem: 102000000 B
min gpu free mem: 162000000 B
Finish loading model /home/vision_rd/face_Recognition/models/GhostNet_Arcface/model/Epoch_24_Time_2020-07-21-13-27_checkpoint.pth, infer with shape: (198, 3, 112, 112)
Loading model time cost: 43.907608 seconds.

Extract on megaface...
Noisy faces of scrub: 605
Noisy faces of gallery: 707
Begin to extract embedding of scrub faces...
Finish Load path of faces: 0/3530
begin thread

Segmentation fault: 11

Stack trace:
[bt] (0) /usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8280) [0x7f6276fd8280]
[bt] (1) /lib64/libc.so.6(+0x363b0) [0x7f635a6483b0]
[bt] (2) /lib64/libc.so.6(cfree+0x1c) [0x7f635a697ecc]
[bt] (3) /usr/lib64/python3.6/site-packages/cv2/cv2.cpython-36m-x86_64-linux-gnu.so(+0x4eda39) [0x7f62b277da39]
[bt] (4) /usr/lib64/python3.6/site-packages/cv2/cv2.cpython-36m-x86_64-linux-gnu.so(+0x168cd5) [0x7f62b23f8cd5]
[bt] (5) /lib64/libpython3.6m.so.1.0(_PyCFunction_FastCallDict+0x147) [0x7f635b3ea167]
[bt] (6) /lib64/libpython3.6m.so.1.0(+0x1507df) [0x7f635b4557df]
[bt] (7) /lib64/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x3a7) [0x7f635b44a0f7]
[bt] (8) /lib64/libpython3.6m.so.1.0(+0x14f987) [0x7f635b454987]
Eval model: /home/vision_rd/yangwenbo/face_Recognition/models/GhostNet_Arcface/model/Epoch_24_Time_2020-07-21-13-27_checkpoint.pth,24, done!
I doubt why I only changed the network model(has been trained). it report errors.

some problem about citrus_base_infer.py

when i read the function run()(line 94) in class readThread (threading.Thread)(line 85) of citrus_base_infer.py. I doubt why there is no normalization before sending the img to the network structure: As far as I know, there is such an operation in training
def run(self):
global signal_stop
unfinished = True
while not signal_stop and unfinished:
try:
image_path, outpath = self.in_q.get(timeout=1)
out_img = []
img = self.read_func(image_path, self.shape[0]==1)
if len(img.shape) == 3:
img = img[:,:,::-1] #to rgb
img = np.transpose( img, (2,0,1) )
else:
img = img[np.newaxis,:]
attempts = [0,1] if self.is_flip else [0]
Hope your replay. Thx

onnx conversion script and inference.

would you update KnowledgeDistillation and Multi Feature KD?

Suggestion: Use Nvidia DALI as the dataloader

Would improve the input speed.

About transfer learning

Hi, I followed your instruction to train a model with MS1M training set, and I ended up achieving 99+@lfw, 95+@cfp, 95+@agedb. This pretrained model also achieved 97+ on my own testing set.

Then I used this pretrained model of mine to do transfer learning on my own training set, which is quite different from MS1M/LFW/CFP/AgeDB.... When I finished the tranfer learning, the final model achieved 99+ on my own testing set, but only 80+@lfw, 70+@cfp, 70+@agedb.

I wonder if I made any mistakes? Is there any ways to imporve the performance on my own testing set while remaining fair enough good performance on other testing sets?

Thank you for your time.

停止更新了吗？

请问这个项目停止更新了吗？

RGB Image or Gray Scale Image

Thanks for your great work!

Did you train/test all these on RGB images? I have a task to train/test on gray scale images, do you think i will suffer from performance degradation in case I train from scratch with all your default settings? Any suggestions? Thank you!

always get prec 0 while using AirFace head

MS1M-RetinaFace Clean List

Hi, you mentioned early in issue #40 that you slightly cleaned the MS1M-RetinaFace dataset.
Wondering if you could share your clean list? It would be great helpful, thank you!

The test result(megaface rankn and roc) of Circle loss is so bad.

The test result(megaface rankn and roc) of Circle loss is so bad. I trained it by efficient-b0. but when I test it on megaface. the result is so bad. Do you ever trained it? if yes. can you provice the pretrained model? ths

About MegaFace testing

when running
r = requests.post("http://127.0.0.1:%d/eval"%(eval_info["args"].port), data=eval_info).json()
Response [404], so .json() returns

simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

how to solve this?

No module named 'citrus_pytorch_infer' when run evaluation/run.sh > logs/log

Traceback (most recent call last):
File "./main.py", line 8, in
from infer import get_infer
File "/home/ici/cavaface/cavaface.pytorch/evaluation/infer/init.py", line 1, in
from citrus_pytorch_infer import CitrusPytorchInfer
ModuleNotFoundError: No module named 'citrus_pytorch_infer'

Error when attempting augmentation

I tried to use RandAugment as well as the RandErasing augmentation but whenever I start the training I get the error: "EOFError: Ran out of input"

Full output follows:

============================================================
/usr/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 11 leaked semaphores to clean up at shutdown
  len(cache))
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/usr/lib/python3.6/multiprocessing/spawn.py", line 113, in _main
    preparation_data = reduction.pickle.load(from_parent)
EOFError: Ran out of input
/usr/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 11 leaked semaphores to clean up at shutdown
  len(cache))
Traceback (most recent call last):
  File "train.py", line 333, in <module>
    main()
  File "train.py", line 58, in main
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, cfg))
  File "/home/imagus/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/imagus/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/home/imagus/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 108, in join
    (error_index, name)
Exception: process 2 terminated with signal SIGSEGV

one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2, 5]], which is output 0 of ClampBackward, is at version 2;

torch1.2 run 'train.sh',when loss.backforward. raise error like this:
one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2, 5]], which is output 0 of ClampBackward, is at version 2; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True)

and i use with torch.autograd.set_detect_anomaly(True): to locate the problem ,find that ''target_logit = cos_theta_1[torch.arange(0, embbedings.size(0)), label].view(-1, 1)'' raise error .

is the torch version problem? Thanks

AttentionNet returns two tensors in inference

When running inference on AttentionNet, I get two tensors back.
The first one is the tensor I am interested in (1 x 512) and the second one is one I am not interested in (7 x 7 x ?).
It should improve performance if this second tensor is not returned in inference mode.

        return out, conv_out

to:

        if train:
            return out, conv_out
        else:
            return out

By the way, I'm very impressed with the performance of this AttentionNet-56!
CFP_FP Acc: 0.9822857142857142, AgeDB Acc: 0.9819999999999999, VGG2_FP Acc: 0.9554
CFP_FP Acc: 0.9831428571428571, AgeDB Acc: 0.9814999999999999, VGG2_FP Acc: 0.9538

Trained on ms1m or deepglint dataset ?

deepglint has more than 2x more identities. Looking forward to trained models on deepglint.

About CircleLoss

@cavalleria Hello, I have observed that there is ‘Circle-Loss’ in the introduction, but not in the code. Is it because of poor performance?Thanks!

about the results reported on model zoo

thanks for sharing such a great work! here i have some questions about the results and training details:

are all the models' training data the same? is it ms1m-retina? if not, have you removed duplicates from megaface?
i have made effort to achieve 99%+ on megaface, but only get 98.5% for now (training data is ms1m-retina). I wonder if it's quite easy to achieve 99% if using you scripts with default settings? have you used horizontal flipped image's feature when test on megaface? what are the training batch size and end epoch?
do all the listed IR-SE-100 models train from scratch? or finetune on one model? if so, which?

looking forward to your reply. thank you.

Error when using alternative head functions

When changing ArcFace to: ArcNegFace, CurriculumFace (didn't try others), I get this:

  File "/home/imagus/.local/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 932, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/home/imagus/.local/lib/python3.6/site-packages/torch/nn/functional.py", line 2317, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/home/imagus/.local/lib/python3.6/site-packages/torch/nn/functional.py", line 1535, in log_softmax
    ret = input.log_softmax(dim)
AttributeError: 'tuple' object has no attribute 'log_softmax'

Pytorch: 1.5.0 (could be issue with this?)

Can't generate result file when evaluate on megaface

I prepared the data directory according to the eval_megaface.py. After evaluation process is over, I got an error:
Traceback (most recent call last):
File "/home/ici/.conda/envs/cavaface/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "evaluate_service.py", line 109, in run
evaluator.parse_results_into_file()
File "/home/ici/cavaface/cavaface.pytorch/evaluation/eval_megaface.py", line 406, in parse_results_into_file
result_dict = _load_json_result_file(result_json_file)
File "/home/ici/cavaface/cavaface.pytorch/evaluation/utils/io.py", line 86, in _load_json_result_file
with open(json_file, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'model/megaface_50/cmc_facescrub_megaface_retina_1000000_1.json'
The "cmc_facescrub_megaface_retina_1000000_1.json" didn't generate after evaluation. And there is no result file.

PolyNet does not exist

It appears in the list of options and so python tries to load it and complains the class doesn't exist. I can't see the file for it so I think you forgot to add it.

arcface(s=64,m=0.5) Accuracy is worse than Softmax

train data:
glint360Datasets(https://github.com/deepinsight/insightface/blob/master/recognition/partial_fc/README.md);
1、 First，train baseline model with R50-Softmax
2、Second，Train ArcFace-Softmax by finetune the backbone using Step1 Backbone model . the acc is lower then origin softmax;

some problem about the flask version

Hi, can you provide the flask version. when I run the evaluate_server.py it reports:
Traceback (most recent call last):
File "evaluate_service.py", line 12, in
import flask
File "/usr/local/lib64/python3.6/site-packages/flask/init.py", line 21, in
from .app import Flask
File "/usr/local/lib64/python3.6/site-packages/flask/app.py", line 69, in
from .wrappers import Request
File "/usr/local/lib64/python3.6/site-packages/flask/wrappers.py", line 14, in
from werkzeug.wrappers.json import JSONMixin as _JSONMixin
ModuleNotFoundError: No module named 'werkzeug.wrappers.json'; 'werkzeug.wrappers' is not a package

GPU problems

I have 4 GPUS. when I write GPU=[0,1,2,3] in the config.py, it will report errors. But when i write GPU=[0,1]. it works well. That's strange

Backone is wrong

In the README, you use Backone not using Backbone

About the pre-trained model

Thank you for this work. I want to use the pre-trained model of AttentionNet-IRSE-56/92 from the MODEL_ZOO.md for fine-tuning. Where can I get the pre-trained model?

AttributeError: module 'torch.distributed' has no attribute 'init_process_group'

eval is not in parallel while traning

Great work!It would be better if you add inference time compare result, FLOPS can't show the true efficiency of the model sometimes.

A question about the augmentation method

HI, the repo is a nice work, thanks for your sharing.
I want to know if these augmentation methods are effective,
like the RandomErasing/Mixup/RandAugment/Cutout/CutMix?

why not train vargfacenet?

the backbone vargfacenet you can try it!

Circle loss report errors

when i run Circle loss, it reports errors like bellow: Do you run it successfully?
Traceback (most recent call last):
File "train.py", line 388, in
main()
File "train.py", line 60, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, cfg))
File "/usr/local/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/usr/local/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/vision_rd/face_Recognition/cavaface.pytorch_bake/train.py", line 291, in main_worker
outputs = head(features, labels)
File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib64/python3.6/site-packages/torch/nn/parallel/distributed.py", line 447, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/vision_rd/face_Recognition/cavaface.pytorch_bake/head/metrics.py", line 590, in forward
output = torch.logsumexp(logit_n, dim=1) + torch.logsumexp(logit_p, dim=1)
IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

Resume network uses more memory than from scratch

I haven't seen this issue before in similar pytorch training scenarios.
I can normally do batch size of 256, but when resuming, I must do 224.

It seems like some memory from loading the resumed model is never freed.

Edit: I resolved the issue by adding in: del(checkpoint)

How to load the pretrained model?

I downloaded the checkpoint from MODEL_ZOO but it fails to load the model.
I load the checkpoint as follows, but it shows the error: ModuleNotFoundError: No module named '__torch__'

BACKBONE.load_state_dict(torch.load('IR_SE_100_Combined_Epoch_24.pth'))

What is the configuration/specs of system that you use for training ?

Can you please tell what are the specs of the system that you use for training models and also how time does it take to train a fresh MobileFaceNet, IR-SE-100 model on ms1m-retinanet dataset ?

Where can I get the IJB-C dataset include "/ijbc_112x112"

I download the /IJB_release.tar from https://github.com/deepinsight/insightface/tree/master/Evaluation. But I didn't find the "/ijbc_112x112" folder.

About the Knowledge Distillation

This is a great job. Will the code for knowledge distillation and model compression be provided later? Will this project continue to update the latest face recognition technology? If so, that would be great. Thanks to the author.

About training data

Thanks for your working! I downloaded ms1m-retinaface training data from the link you shared.I saw your datasets.py required separate image files to open but there is no such file in ms1m-retinaface-t1. I'm confused,how can I use .rec,.lst&.idx file with your training code?

where is "ms1m-retinaface-t1-clean.txt" file

I download train dateset as "For training data, please download the ms1m-retinaface in https://github.com/deepinsight/insightface/tree/master/iccv19-challenge."
but not found ms1m-retinaface-t1-clean.txt, i got this files:
-rw-r--r-- 1 shengyang root 73M Apr 9 2019 agedb_30.bin
-rw-r--r-- 1 shengyang root 73M Apr 1 2019 cfp_fp.bin
-rw-r--r-- 1 shengyang root 63M Apr 1 2019 lfw.bin
-rw-r--r-- 1 shengyang root 14 Apr 1 2019 property
-rw-r--r-- 1 shengyang root 98M Apr 1 2019 train.idx
-rw-r--r-- 1 shengyang root 412M Apr 1 2019 train.lst
-rw-r--r-- 1 shengyang root 28G Apr 1 2019 train.rec

how to make train dataset? thank you!

spawn.py problem

how to solve this problem ?

CircleLoss correctness

Hi,
It seems like there is a difference in the CircleLoss implementation comparing to the original implementation. In your implementation, the loss is computed like this: log(sum(exp(logits_p))*sum(exp(logits_n)))
https://github.com/cavalleria/cavaface.pytorch/blob/90b8a1c2d689552b5d7e5c703c02c583dfce1a6c/head/metrics.py#L585
But the original formula is: log(1+sum(exp(logits_p))*sum(exp(logits_n)))
https://github.com/qianjinhao/circle-loss/blob/master/circle_loss.py#L38

'CutMix' ——An error occurred like “name 'mixed_x' is not defined”

when i use data pre-process 'CutMix' ,An error occurred like “name 'mixed_x' is not defined” .
the method 'cutmix_data' return "mixed_x, y_a, y_b, lam" mixed_x' is not defined

cavalleria / cavaface Goto Github PK

cavaface's Introduction

cavaface: A Pytorch Training Framework for Deep Face Recognition

Introduction

Features

Installation

Quick start

Model Zoo and Benchmark

License

Acknowledgement

Contact

cavaface's People

Contributors

Stargazers

Watchers

Forkers

cavaface's Issues

When I evaluate the Mobilefacenet model, the megaface evaluation works well. But When i evaluate the Ghostnet, it reports error like:

Total Tensors: 4096752 Used Memory: 15.70M The allocated memory on cuda:0: 77.55M Memory differs due to the matrix alignment or invisible gradient buffer tensors

Recommend Projects

Recommend Topics

Recommend Org

Total Tensors: 4096752 Used Memory: 15.70M
The allocated memory on cuda:0: 77.55M
Memory differs due to the matrix alignment or invisible gradient buffer tensors