Coder Social home page Coder Social logo

cavalleria / cavaface Goto Github PK

View Code? Open in Web Editor NEW
453.0 24.0 88.0 312 KB

face recognition training project(pytorch)

License: MIT License

Python 99.85% Shell 0.15%
face-recognition pytorch network loss arcnegface arcface resnest attention-irse apex randaugment

cavaface's Introduction

cavaface: A Pytorch Training Framework for Deep Face Recognition

python-url pytorch-url License: MIT Docker Pulls.

By Yaobin Li and Liying Chi

Introduction

This repo provides a high-performance distribute parallel training framework for face recognition with pytorch, including various backbones (e.g., ResNet, IR, IR-SE, ResNeXt, AttentionNet-IR-SE, ResNeSt, HRNet, etc.), various losses (e.g., Softmax, Focal, SphereFace, CosFace, AmSoftmax, ArcFace, ArcNegFace, CurricularFace, Li-Arcface, QAMFace, etc.), various data augmentation(e.g., RandomErasing, Mixup, RandAugment, Cutout, CutMix, etc.) and bags of tricks for improving performance (e.g., FP16 training(apex), Label smooth, LR warmup, etc)

Features

(click to collapse)
  • Backbone
    • ResNet(IR-SE)
    • ResNeXt
    • DenseNet
    • MobileFaceNet
    • MobileNetV3
    • EfficientNet
    • ProxylessNas
    • GhostNet
    • AttentionNet-IRSE
    • ResNeSt
    • ReXNet
    • MobileNetV2
    • MobileNeXt
  • Attention Module
    • SE
    • CBAM
    • ECA
    • GCT
  • Loss
    • Softmax
    • SphereFace
    • AMSoftmax
    • CosFace
    • ArcFace
    • Combined Loss
    • AdaCos
    • SV-X-Softmax
    • CurricularFace
    • ArcNegFace
    • Li-Arcface
    • QAMFace
    • Circle Loss
  • Parallel Training
    • DDP
    • Model Parallel
  • Automatic Mixed Precision
    • AMP
  • Optimizer
    • LRScheduler(faireq,rwightman)
    • Optim(SGD,Adam,RAdam,LookAhead,Ranger,AdamP,SGDP)
    • ZeRO
  • [Data Augmentation
    • RandomErasing
    • Mixup
    • RandAugment
    • Cutout
    • CutMix
    • Colorjitter
  • Distillation
    • KnowledgeDistillation
    • Multi Feature KD
  • Bag of Tricks
    • Label smooth
    • LR warmup

Installation

See INSTALL.md.

Quick start

See GETTING_STARTED.md.

Model Zoo and Benchmark

See MODEL_ZOO.md.

License

cavaface is released under the MIT license.

Acknowledgement

Contact

cavaface's People

Contributors

cavalleria avatar charrin avatar dependabot[bot] avatar xsacha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cavaface's Issues

SV-X-Softmax usage error

When training with SV-X-Softmax, the following error occurs:

 File "/home/imagus/dev/cavaface.pytorch/head/metrics.py", line 452, in forward
    if self.xtype == 'MV-AM':
  File "/home/imagus/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 594, in __getattr__
    type(self).__name__, name))
AttributeError: 'SVXSoftmax' object has no attribute 'xtype'

EfficientNet_b1 ACC is always 0

Hi, Do you use the pretrained model when train EfficientNet_b1. when I train it without pretrained model, the top1 and top5 ACC is always 0.

Pre-trained model performance

Hi,
I tested the released GhostNet_x1.3 on LFW, CFP-FP and AgeDB-30. It's weird that the model gets much better results on CFP-FP and AgeDB-30 than LFW, have you tried these testing sets? What's the gap?

Thanks.

LFW: Acc 0.850 @ Threshold 0.461
CFP-FP: Acc 0.943 @ Threshold 0.179
AgeDB-30: Acc 0.973 @ Threshold 0.231

ReXnet based results

Hi, big thanks for your great work, 6666!
Have you tried rexnet based models? If yes, could u share the results? Thanks!

some problem about megaface evaluation

When I evaluate the Mobilefacenet model, the megaface evaluation works well. But When i evaluate the Ghostnet, it reports error like:

Total Tensors: 4096752 Used Memory: 15.70M
The allocated memory on cuda:0: 77.55M
Memory differs due to the matrix alignment or invisible gradient buffer tensors

min gpu free mem: 8000000000.0 B
min gpu free mem: 8000000000.0 B
min gpu free mem: 102000000 B
min gpu free mem: 162000000 B
Finish loading model /home/vision_rd/face_Recognition/models/GhostNet_Arcface/model/Epoch_24_Time_2020-07-21-13-27_checkpoint.pth, infer with shape: (198, 3, 112, 112)
Loading model time cost: 43.907608 seconds.

Extract on megaface...
Noisy faces of scrub: 605
Noisy faces of gallery: 707
Begin to extract embedding of scrub faces...
Finish Load path of faces: 0/3530
begin thread

Segmentation fault: 11

Stack trace:
[bt] (0) /usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8280) [0x7f6276fd8280]
[bt] (1) /lib64/libc.so.6(+0x363b0) [0x7f635a6483b0]
[bt] (2) /lib64/libc.so.6(cfree+0x1c) [0x7f635a697ecc]
[bt] (3) /usr/lib64/python3.6/site-packages/cv2/cv2.cpython-36m-x86_64-linux-gnu.so(+0x4eda39) [0x7f62b277da39]
[bt] (4) /usr/lib64/python3.6/site-packages/cv2/cv2.cpython-36m-x86_64-linux-gnu.so(+0x168cd5) [0x7f62b23f8cd5]
[bt] (5) /lib64/libpython3.6m.so.1.0(_PyCFunction_FastCallDict+0x147) [0x7f635b3ea167]
[bt] (6) /lib64/libpython3.6m.so.1.0(+0x1507df) [0x7f635b4557df]
[bt] (7) /lib64/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x3a7) [0x7f635b44a0f7]
[bt] (8) /lib64/libpython3.6m.so.1.0(+0x14f987) [0x7f635b454987]
Eval model: /home/vision_rd/yangwenbo/face_Recognition/models/GhostNet_Arcface/model/Epoch_24_Time_2020-07-21-13-27_checkpoint.pth,24, done!
I doubt why I only changed the network model(has been trained). it report errors.

some problem about citrus_base_infer.py

when i read the function run()(line 94) in class readThread (threading.Thread)(line 85) of citrus_base_infer.py. I doubt why there is no normalization before sending the img to the network structure: As far as I know, there is such an operation in training
def run(self):
global signal_stop
unfinished = True
while not signal_stop and unfinished:
try:
image_path, outpath = self.in_q.get(timeout=1)
out_img = []
img = self.read_func(image_path, self.shape[0]==1)
if len(img.shape) == 3:
img = img[:,:,::-1] #to rgb
img = np.transpose( img, (2,0,1) )
else:
img = img[np.newaxis,:]
attempts = [0,1] if self.is_flip else [0]
Hope your replay. Thx

About transfer learning

Hi, I followed your instruction to train a model with MS1M training set, and I ended up achieving 99+@lfw, 95+@cfp, 95+@agedb. This pretrained model also achieved 97+ on my own testing set.

Then I used this pretrained model of mine to do transfer learning on my own training set, which is quite different from MS1M/LFW/CFP/AgeDB.... When I finished the tranfer learning, the final model achieved 99+ on my own testing set, but only 80+@lfw, 70+@cfp, 70+@agedb.

I wonder if I made any mistakes? Is there any ways to imporve the performance on my own testing set while remaining fair enough good performance on other testing sets?

Thank you for your time.

RGB Image or Gray Scale Image

Thanks for your great work!

Did you train/test all these on RGB images? I have a task to train/test on gray scale images, do you think i will suffer from performance degradation in case I train from scratch with all your default settings? Any suggestions? Thank you!

MS1M-RetinaFace Clean List

Hi, you mentioned early in issue #40 that you slightly cleaned the MS1M-RetinaFace dataset.
Wondering if you could share your clean list? It would be great helpful, thank you!

About MegaFace testing

when running
r = requests.post("http://127.0.0.1:%d/eval"%(eval_info["args"].port), data=eval_info).json()
Response [404], so .json() returns

simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

how to solve this?

Error when attempting augmentation

I tried to use RandAugment as well as the RandErasing augmentation but whenever I start the training I get the error: "EOFError: Ran out of input"

Full output follows:

============================================================
/usr/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 11 leaked semaphores to clean up at shutdown
  len(cache))
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/usr/lib/python3.6/multiprocessing/spawn.py", line 113, in _main
    preparation_data = reduction.pickle.load(from_parent)
EOFError: Ran out of input
/usr/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 11 leaked semaphores to clean up at shutdown
  len(cache))
Traceback (most recent call last):
  File "train.py", line 333, in <module>
    main()
  File "train.py", line 58, in main
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, cfg))
  File "/home/imagus/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/imagus/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/home/imagus/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 108, in join
    (error_index, name)
Exception: process 2 terminated with signal SIGSEGV

one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2, 5]], which is output 0 of ClampBackward, is at version 2;

torch1.2 run 'train.sh',when loss.backforward. raise error like this:
one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2, 5]], which is output 0 of ClampBackward, is at version 2; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True)

and i use with torch.autograd.set_detect_anomaly(True): to locate the problem ,find that ''target_logit = cos_theta_1[torch.arange(0, embbedings.size(0)), label].view(-1, 1)'' raise error .

is the torch version problem? Thanks

AttentionNet returns two tensors in inference

When running inference on AttentionNet, I get two tensors back.
The first one is the tensor I am interested in (1 x 512) and the second one is one I am not interested in (7 x 7 x ?).
It should improve performance if this second tensor is not returned in inference mode.

        return out, conv_out

to:

        if train:
            return out, conv_out
        else:
            return out

By the way, I'm very impressed with the performance of this AttentionNet-56!
CFP_FP Acc: 0.9822857142857142, AgeDB Acc: 0.9819999999999999, VGG2_FP Acc: 0.9554
CFP_FP Acc: 0.9831428571428571, AgeDB Acc: 0.9814999999999999, VGG2_FP Acc: 0.9538

About CircleLoss

@cavalleria Hello, I have observed that there is ‘Circle-Loss’ in the introduction, but not in the code. Is it because of poor performance?Thanks!

about the results reported on model zoo

thanks for sharing such a great work! here i have some questions about the results and training details:

  1. are all the models' training data the same? is it ms1m-retina? if not, have you removed duplicates from megaface?
  2. i have made effort to achieve 99%+ on megaface, but only get 98.5% for now (training data is ms1m-retina). I wonder if it's quite easy to achieve 99% if using you scripts with default settings? have you used horizontal flipped image's feature when test on megaface? what are the training batch size and end epoch?
  3. do all the listed IR-SE-100 models train from scratch? or finetune on one model? if so, which?

looking forward to your reply. thank you.

Error when using alternative head functions

When changing ArcFace to: ArcNegFace, CurriculumFace (didn't try others), I get this:

  File "/home/imagus/.local/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 932, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/home/imagus/.local/lib/python3.6/site-packages/torch/nn/functional.py", line 2317, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/home/imagus/.local/lib/python3.6/site-packages/torch/nn/functional.py", line 1535, in log_softmax
    ret = input.log_softmax(dim)
AttributeError: 'tuple' object has no attribute 'log_softmax'

Pytorch: 1.5.0 (could be issue with this?)

Can't generate result file when evaluate on megaface

I prepared the data directory according to the eval_megaface.py. After evaluation process is over, I got an error:
Traceback (most recent call last):
File "/home/ici/.conda/envs/cavaface/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "evaluate_service.py", line 109, in run
evaluator.parse_results_into_file()
File "/home/ici/cavaface/cavaface.pytorch/evaluation/eval_megaface.py", line 406, in parse_results_into_file
result_dict = _load_json_result_file(result_json_file)
File "/home/ici/cavaface/cavaface.pytorch/evaluation/utils/io.py", line 86, in _load_json_result_file
with open(json_file, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'model/megaface_50/cmc_facescrub_megaface_retina_1000000_1.json'
The "cmc_facescrub_megaface_retina_1000000_1.json" didn't generate after evaluation. And there is no result file.

PolyNet does not exist

It appears in the list of options and so python tries to load it and complains the class doesn't exist. I can't see the file for it so I think you forgot to add it.

some problem about the flask version

Hi, can you provide the flask version. when I run the evaluate_server.py it reports:
Traceback (most recent call last):
File "evaluate_service.py", line 12, in
import flask
File "/usr/local/lib64/python3.6/site-packages/flask/init.py", line 21, in
from .app import Flask
File "/usr/local/lib64/python3.6/site-packages/flask/app.py", line 69, in
from .wrappers import Request
File "/usr/local/lib64/python3.6/site-packages/flask/wrappers.py", line 14, in
from werkzeug.wrappers.json import JSONMixin as _JSONMixin
ModuleNotFoundError: No module named 'werkzeug.wrappers.json'; 'werkzeug.wrappers' is not a package

GPU problems

I have 4 GPUS. when I write GPU=[0,1,2,3] in the config.py, it will report errors. But when i write GPU=[0,1]. it works well. That's strange

About the pre-trained model

Thank you for this work. I want to use the pre-trained model of AttentionNet-IRSE-56/92 from the MODEL_ZOO.md for fine-tuning. Where can I get the pre-trained model?

A question about the augmentation method

HI, the repo is a nice work, thanks for your sharing.
I want to know if these augmentation methods are effective,
like the RandomErasing/Mixup/RandAugment/Cutout/CutMix?

Circle loss report errors

when i run Circle loss, it reports errors like bellow: Do you run it successfully?
Traceback (most recent call last):
File "train.py", line 388, in
main()
File "train.py", line 60, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, cfg))
File "/usr/local/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/usr/local/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/vision_rd/face_Recognition/cavaface.pytorch_bake/train.py", line 291, in main_worker
outputs = head(features, labels)
File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib64/python3.6/site-packages/torch/nn/parallel/distributed.py", line 447, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/vision_rd/face_Recognition/cavaface.pytorch_bake/head/metrics.py", line 590, in forward
output = torch.logsumexp(logit_n, dim=1) + torch.logsumexp(logit_p, dim=1)
IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

Resume network uses more memory than from scratch

I haven't seen this issue before in similar pytorch training scenarios.
I can normally do batch size of 256, but when resuming, I must do 224.

It seems like some memory from loading the resumed model is never freed.

Edit: I resolved the issue by adding in: del(checkpoint)

How to load the pretrained model?

I downloaded the checkpoint from MODEL_ZOO but it fails to load the model.
I load the checkpoint as follows, but it shows the error: ModuleNotFoundError: No module named '__torch__'

BACKBONE.load_state_dict(torch.load('IR_SE_100_Combined_Epoch_24.pth'))

About the Knowledge Distillation

This is a great job. Will the code for knowledge distillation and model compression be provided later? Will this project continue to update the latest face recognition technology? If so, that would be great. Thanks to the author.

About training data

Thanks for your working! I downloaded ms1m-retinaface training data from the link you shared.I saw your datasets.py required separate image files to open but there is no such file in ms1m-retinaface-t1. I'm confused,how can I use .rec,.lst&.idx file with your training code?

where is "ms1m-retinaface-t1-clean.txt" file

I download train dateset as "For training data, please download the ms1m-retinaface in https://github.com/deepinsight/insightface/tree/master/iccv19-challenge."
but not found ms1m-retinaface-t1-clean.txt, i got this files:
-rw-r--r-- 1 shengyang root 73M Apr 9 2019 agedb_30.bin
-rw-r--r-- 1 shengyang root 73M Apr 1 2019 cfp_fp.bin
-rw-r--r-- 1 shengyang root 63M Apr 1 2019 lfw.bin
-rw-r--r-- 1 shengyang root 14 Apr 1 2019 property
-rw-r--r-- 1 shengyang root 98M Apr 1 2019 train.idx
-rw-r--r-- 1 shengyang root 412M Apr 1 2019 train.lst
-rw-r--r-- 1 shengyang root 28G Apr 1 2019 train.rec

how to make train dataset? thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.