Coder Social home page Coder Social logo

kakaobrain / fast-autoaugment Goto Github PK

View Code? Open in Web Editor NEW
1.6K 41.0 196.0 1.49 MB

Official Implementation of 'Fast AutoAugment' in PyTorch.

License: MIT License

Python 100.00%
deep-learning convolutional-neural-networks pytorch augmentation image-classification computer-vision distributed cnn automl automated-machine-learning

fast-autoaugment's Introduction

Fast AutoAugment (Accepted at NeurIPS 2019)

Official Fast AutoAugment implementation in PyTorch.

  • Fast AutoAugment learns augmentation policies using a more efficient search strategy based on density matching.
  • Fast AutoAugment speeds up the search time by orders of magnitude while maintaining the comparable performances.

Results

CIFAR-10 / 100

Search : 3.5 GPU Hours (1428x faster than AutoAugment), WResNet-40x2 on Reduced CIFAR-10

Model(CIFAR-10) Baseline Cutout AutoAugment Fast AutoAugment
(transfer/direct)
Wide-ResNet-40-2 5.3 4.1 3.7 3.6 / 3.7 Download
Wide-ResNet-28-10 3.9 3.1 2.6 2.7 / 2.7 Download
Shake-Shake(26 2x32d) 3.6 3.0 2.5 2.7 / 2.5 Download
Shake-Shake(26 2x96d) 2.9 2.6 2.0 2.0 / 2.0 Download
Shake-Shake(26 2x112d) 2.8 2.6 1.9 2.0 / 1.9 Download
PyramidNet+ShakeDrop 2.7 2.3 1.5 1.8 / 1.7 Download
Model(CIFAR-100) Baseline Cutout AutoAugment Fast AutoAugment
(transfer/direct)
Wide-ResNet-40-2 26.0 25.2 20.7 20.7 / 20.6 Download
Wide-ResNet-28-10 18.8 18.4 17.1 17.3 / 17.3 Download
Shake-Shake(26 2x96d) 17.1 16.0 14.3 14.9 / 14.6 Download
PyramidNet+ShakeDrop 14.0 12.2 10.7 11.9 / 11.7 Download

ImageNet

Search : 450 GPU Hours (33x faster than AutoAugment), ResNet-50 on Reduced ImageNet

Model Baseline AutoAugment Fast AutoAugment
(Top1/Top5)
ResNet-50 23.7 / 6.9 22.4 / 6.2 22.4 / 6.3 Download
ResNet-200 21.5 / 5.8 20.0 / 5.0 19.4 / 4.7 Download

Notes

  • We evaluated resnet-50 and resnet-200 with resolution of 224 and 320, respectively. According to the original resnet paper, resnet 200 was tested with the resolution of 320. Also our resnet-200 baseline's performance was similar when we use the resolution.
  • But with recent our code clean-up and bugfixes, we've found that the baseline performs similar to the baseline even using 224x224.
  • When we use 224x224, resnet-200 performs 20.0 / 5.2. Download link for the trained model is here.

We have conducted additional experiments with EfficientNet.

Model Baseline AutoAugment Our Baseline(Batch) +Fast AA
B0 23.2 22.7 22.96 22.68

SVHN Test

Search : 1.5 GPU Hours

Baseline AutoAug / Our Fast AutoAugment
Wide-Resnet28x10 1.5 1.1 1.1

Run

We conducted experiments under

  • python 3.6.9
  • pytorch 1.2.0, torchvision 0.4.0, cuda10

Search a augmentation policy

Please read ray's document to construct a proper ray cluster : https://github.com/ray-project/ray, and run search.py with the master's redis address.

$ python search.py -c confs/wresnet40x2_cifar10_b512.yaml --dataroot ... --redis ...

Train a model with found policies

You can train network architectures on CIFAR-10 / 100 and ImageNet with our searched policies.

  • fa_reduced_cifar10 : reduced CIFAR-10(4k images), WResNet-40x2
  • fa_reduced_imagenet : reduced ImageNet(50k images, 120 classes), ResNet-50
$ export PYTHONPATH=$PYTHONPATH:$PWD
$ python FastAutoAugment/train.py -c confs/wresnet40x2_cifar10_b512.yaml --aug fa_reduced_cifar10 --dataset cifar10
$ python FastAutoAugment/train.py -c confs/wresnet40x2_cifar10_b512.yaml --aug fa_reduced_cifar10 --dataset cifar100
$ python FastAutoAugment/train.py -c confs/wresnet28x10_cifar10_b512.yaml --aug fa_reduced_cifar10 --dataset cifar10
$ python FastAutoAugment/train.py -c confs/wresnet28x10_cifar10_b512.yaml --aug fa_reduced_cifar10 --dataset cifar100
...
$ python FastAutoAugment/train.py -c confs/resnet50_b512.yaml --aug fa_reduced_imagenet
$ python FastAutoAugment/train.py -c confs/resnet200_b512.yaml --aug fa_reduced_imagenet

By adding --only-eval and --save arguments, you can test trained models without training.

If you want to train with multi-gpu/node, use torch.distributed.launch such as

$ python -m torch.distributed.launch --nproc_per_node={num_gpu_per_node} --nnodes={num_node} --master_addr={master} --master_port={master_port} --node_rank={0,1,2,...,num_node} FastAutoAugment/train.py -c confs/efficientnet_b4.yaml --aug fa_reduced_imagenet

Citation

If you use this code in your research, please cite our paper.

@inproceedings{lim2019fast,
  title={Fast AutoAugment},
  author={Lim, Sungbin and Kim, Ildoo and Kim, Taesup and Kim, Chiheon and Kim, Sungwoong},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2019}
}

Contact for Issues

References & Opensources

We increase the batch size and adapt the learning rate accordingly to boost the training. Otherwise, we set other hyperparameters equal to AutoAugment if possible. For the unknown hyperparameters, we follow values from the original references or we tune them to match baseline performances.

fast-autoaugment's People

Contributors

ildoonet avatar sublee avatar zsef123 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fast-autoaugment's Issues

Packages versions

Having prepared environment.yml with:

name: fast-aa
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.6.9
  - pytorch=1.2.0
  - torchvision=0.4.0
  - cudatoolkit=10
  - pip
  - pip:
      - git+https://github.com/wbaek/theconf@de32022f8c0651a043dc812d17194cdfd62066e8
      - git+https://github.com/ildoonet/pytorch-gradual-warmup-lr.git@08f7d5e
      - git+https://github.com/ildoonet/pystopwatch2.git
      - git+https://github.com/hyperopt/hyperopt.git
      - pretrainedmodels
      - gorilla
      - tabulate
      - pandas
      - tqdm
      - tensorboardx
      - sklearn
      - ray
      - psutil
      - setproctitle
      - requests

and using search.py with:

python FastAutoAugment/search.py -c confs/wresnet40x2_cifar.yaml

I keep receiving warnings or errors on wrong or missing package, e.g.

HyperOptSearch  DeprecationWarning: This class has been moved.

Could you be able to share validated package versions ?

Some questions about the paper and code

Hi~ I have some questions about the paper and code:

Questions about the code

  1. What does "tta" mean in the function eval_tta() (in search.py)?
  2. Why for _ in range(1): are used in some places like search.py and class Augmentation in data.py?

Questions about the algorithm

  1. It seems that CIFAR-10 dataset does not have an official valid set, so cross-validation is often used. If a dataset already has its valid set, can we just use training set as $D_M$ (defined in paper) and use valid set as $D_A$ directly? If so, can we search policies without the process of cross-validation?
  2. Section 3.2.1 of the paper says:

our goal is to improve the generalization ability by searching the augmentation policies that match the density of $D_{train}$ with density of augmented $D_{valid}$

However, the algorithm of ffa just seems to fit the trained model. This algorithm may only pick
augmented data that can make model get high score easily, and these 'easy' augmented data may not match training data? Is there any theoretical guarantee that this algorithm can work?

Inconsistency (maybe) between code and paper

  1. Section 3.1 of the paper says:

$\mathcal{T}$ indicates a set of augmented images of dataset D transformed by every sub-policies $\tau \in \mathcal{T}$

However, in class Augmentation in data.py, policy = random.choice(self.policies) is used, so only one of five policies will be used during searching test time augmentation policies. policy in the code is the same as sub-policy right? But this method is actually used in AutoAugment, not faa?

  1. It seems that you choose best top N policies from every fold according figure 2 and code:
    image
    So I think line 7 and 8 should be in the first loop, not the second?
    image

Thanks very much if you can offer some help!

NameError: name 'args' is not defined

Hi, I've tried to run the code for searching policies, but there is no way I can make it run on a single machine with several GPUs and the problem seems to be with Ray. I do initialize the ray server correctly, but apparently the trouble is with the train_and_eval function.

Any suggestion about what I could be doing wrong/ how to workaround this?

2020-08-15 10:18:09,387 ERROR worker.py:1717 -- Possible unhandled error from worker: ray_worker (pid=13478, host=macaron) File "FastAutoAugment/search.py", line 66, in train_model result = train_and_eval(None, dataroot, cv_ratio_test, cv_fold, save_path=save_path, only_eval=skip_exist) File "/home/gim282/data_augmentation/good/FastAutoAugment/train.py", line 123, in train_and_eval add_filehandler(logger, args.save + '.log') NameError: name 'args' is not defined

requirement 'ascii' codec can't decode byte 0xec

Hello!
I use python3.6.5
I install use pip install git+https://github.com/wbaek/theconf.git
I get:
Collecting git+https://github.com/wbaek/theconf.git
Cloning https://github.com/wbaek/theconf.git to /tmp/pip-req-build-pun46x94
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-req-build-pun46x94/setup.py", line 19, in
long_description = fp.read()
File "/opt/conda/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xec in position 449: ordinal not in range(128)

parameter settings different with paper reported.

@ildoonet , I refer to the paper.

  1. At the search stage, #epochs =90. while in your code is 270. If we set the #epochs=90, how to set the lr policy?
  2. At the search stage, #training samples=6000. while in your code, for reduced_imagenet, the #training samples seems almost 50000.

Can you explain this? And which one should I follow?

Many thanks.

Backpropagate through PILLOW operations

Dear authors! I am having a question about back-propagation through operation of PILLOW library. How to pass probability parameter p through PIL operations and learn this parameter?

Questions about search and retrain.

@ildoonet @sublee Hi, it is so nice of you guys to release the search code and will be much appreciated. While running with your search code and retrain with found policies, I still got some problems and hopefully you can help me figure them out.

  1. Some bugs in the search code after policy search, seems like the policy cannot be fed into the final retrain stage?
  2. About your search process, for me it looks like during each trial you select 5 sub-policies containing 2 operations, while in your eval_tta, you just sampled 5 val_loaders by randomly select 1 sub-policy from your 5 sub-policies and report the avg acc? I'm wondering what's the intuition behind this?
  3. After running search with your code, we can get a policy list with around 750 sub-polices. I'm wondering how many sub-policies you used to get the results reported in the paper.
  4. I've retrained the model with randomly selected polices, p and magnitude many times, I can consistently get the comparable error rates with the results trained by best policies. This makes me very confused, so I'm very much desired to listen to your opinions.

Looking forward to your reply
Thanks again.

IMAGENET url not found

In Imagenet.py, Imagenet files are allocated through pre-defined URL. But, If I've tried to run the Imagenet.py, the pre-defined URL is not found.
Any suggestion Using Imagenet URL??
Thank you!

Question about Sample Pairing operation

I did not find the Sample Pairing operation listed in the policies found in CIFAR. From FastAutoAugment/augmentations.py, I also noticed that Related codes for Sample Pairing operation were marked as commented-out code. Do these mean that in the experiments, this operation is not included in the search space? I will be grateful if you could give me some suggestions.

Questions abut the search process

Hi, @ildoonet

Thanks for your great work, and it has inpired me a lot. Recently, I'm trying to reproduce the search results. When I utilize the ray.tune.hyperoptsearch method as the search method, I cannot get higher accuracy after augmenting the validation data compared with "without augmentation". However, as mentioned in your paper, the results should be better.
image

Is the phenomenon is normal? If not, how can you utilize the ray package to implement the search process?

Looking forward to your reply.

Thanks.

I wanna repeat your algorithm for search policies but don't understand some moments in paper.

I create this algorithm for voice-antispoofing problem.

Answer please for next some questions because i did not understand this from your paper (Fast AutoAugment):

  1. Do you choose for imagenet only 120 classes (120 * 1000 = 120 000 pictures) and from this you randomly choose 6000 samples? Do you train all algorithm only on 6000 pictures? And what is the ratio Dm:Da in StratifiedShuffleSplit (test_size parameter)?
  2. Is this continuous training on each step of algorithm? Or each iterations just started from random initialize model? 90 epoch (in your paper for resnet on imagenet) - it is on each training Dm set?
  3. From each fold you get best 10 policies (each contain 5 sub-policies). And after each iteration of algorithm you choose 50 policies? It is too much! How do you choose best policies of this 50? And how do you apply this many policies in train? Are you choose it in random?
  4. And how do you choose best policies on the end of the training?

I hope I described it clearly.
With love, Makarov Rostislav.

Using your code I couldn't achieve the acc you upload.

I trained Imagenet using 32 GPUs via horovod (8V100*4) but got acc 77.1% which was much less than 78.6% reported in your paper by running:
python train.py -c confs/resnet50_imagenet_b4096.yaml --aug fa_reduced_imagenet --horovod
Moreover,as your yaml config,lr type should be multistep(adjust_learning_rate_resnet) which can be seen in train.py,but i saw cosine lr decay adopted during my test via your code.
Waiting for your reply ,thx.

Problems running search.py

Hi,thank you for your work.

Here are my problems:

(1) from watch import PyStopwatch
ImportError: cannot import name 'PyStopwatch' from 'watch' (D:\Anaconda3\lib\site-packages\watch_init_.py)

(2)

[2021-07-24 14:43:40,018] [Fast AutoAugment] [INFO] initialize ray...
2021-07-24 14:43:43,768 INFO services.py:1272 -- View the Ray dashboard at http://127.0.0.1:8265
[2021-07-24 14:43:55,301] [Fast AutoAugment] [INFO] search augmentation policies, dataset=cifar10 model=wresnet40_2
[2021-07-24 14:43:55,301] [Fast AutoAugment] [INFO] ----- Train without Augmentations cv=5 ratio(test)=0.4 -----
['C:\Users\djr83\Desktop\fast-autoaugment-master\FastAutoAugment\models/cifar10_wresnet40_2_ratio0.4_fold0.model', 'C:\Users\djr83\Desktop\fast-autoaugment-master\FastAutoAugment\models/cifar10_wresnet40_2_ratio0.4_fold1.model', 'C:\Users\djr83\Desktop\fast-autoaugment-master\FastAutoAugment\models/cifar10_wresnet40_2_ratio0.4_fold2.model', 'C:\Users\djr83\Desktop\fast-autoaugment-master\FastAutoAugment\models/cifar10_wresnet40_2_ratio0.4_fold3.model', 'C:\Users\djr83\Desktop\fast-autoaugment-master\FastAutoAugment\models/cifar10_wresnet40_2_ratio0.4_fold4.model']
0%| | 0/2 [00:00<?, ?it/s]2021-07-24 14:43:55,344 WARNING worker.py:1123 -- The actor or task with ID a67dc375e60ddd1affffffffffffffffffffffff01000000 cannot be scheduled right now. It requires {GPU: 4.000000}, {CPU: 1.000000} for placement, however the cluster currently cannot provide the requested resources. The required resources may be added as autoscaling takes place or placement groups are scheduled. Otherwise, consider reducing the resource requirements of the task.
0%| | 0/2 [18:40<?, ?it/s]

The version of ray I use is 1.4 and runs under win10.
Can you describe the running environment in detail.

Thank you.

New dataset example

Could you provide an example of how to use ur technique to find the best policy for another dataset or maybe for CIFAR.

Thanks for your work!

GPUs for searching

Hello, I see in the serarching code setting num-search 200 and resources_per_trial 1. Does this denote that fast autoaugment needs 200 gpus for searching?

Why not updating batch norm parameters?

According to the following lines, your code doesn't update batch norm parameters. What is the real reason for that?

params_without_bn = [params for name, params in model.named_parameters() if not ('_bn' in name or '.bn' in name)]

ValueError in search.py

When I run search.py, I got an error by register_trainable

ValueError: Unknown argument found in the Trainable function. The function args must include a 'config' positional parameter. Any other args must be 'checkpoint_dir'. Found: ['augs', 'rpt']

Any ideas how to fix this?

kornia integration

hi guys! nice paper.

was wondering if you considered using kornia.augmentation for your project ? I believe it can help to differentiate over the whole set of augmentation operators.

This would help pretty much also for us to test in real use cases robustness of our API.

Thanks in advance,
Edgar

how to run the search.py, why search.py load ckpt before training?

when i use
python search.py -c confs/wresnet40x2_cifar10_b512.yaml --dataroot ... --redis ...
i got result as

[Errno 2] No such file or directory: '/home/kaijie.tang/code/fast-autoaugment/FastAutoAugment/models/cifar10_wresnet40_2_ratio0.4_fold0.model'
[Errno 2] No such file or directory: '/home/kaijie.tang/code/fast-autoaugment/FastAutoAugment/models/cifar10_wresnet40_2_ratio0.4_fold1.model'
[Errno 2] No such file or directory: '/home/kaijie.tang/code/fast-autoaugment/FastAutoAugment/models/cifar10_wresnet40_2_ratio0.4_fold2.model'
[Errno 2] No such file or directory: '/home/kaijie.tang/code/fast-autoaugment/FastAutoAugment/models/cifar10_wresnet40_2_ratio0.4_fold3.model'
[Errno 2] No such file or directory: '/home/kaijie.tang/code/fast-autoaugment/FastAutoAugment/models/cifar10_wresnet40_2_ratio0.4_fold4.model'

code is error in search.py line 186 like

       for cv_idx in range(cv_num):
                try:
                    latest_ckpt = torch.load(paths[cv_idx])
                    if 'epoch' not in latest_ckpt:
                        epochs_per_cv['cv%d' % (cv_idx + 1)] = C.get()['epoch']
                        continue
                    epochs_per_cv['cv%d' % (cv_idx+1)] = latest_ckpt['epoch']
                except Exception as e:
                    continue

why the code load the ckpt before training?

where to find or generate those model ckpt?

Can't run search.py

First of all, thanks you very much for your generous to sharing code public.

My problem happen when I try to run search.py file and it returns error as an image in below. I don't know why how I get folder models in FastautoAugment folder.
Hope that you can answer question soon. Thank you very much!

P/s: I run your code in my google colaboratory.

image

The result on ResNet18 is frustrating

Hi Ildoo, thank you for sharing great code.

I tried ResNet18 on ImageNet, but the result is not good. Have you ever experimented on ResNet18, or do you have any suggestions?

Hyperparameters: SGD with linear lr=0.4, batch=1024, weight decay=1e-4, epochs=120.

method top1
baseline 70.68
fast aa 70.22

The learning curves:
image

A question regarding the `RandomCrop`

Hi,

Thank you for the work. Just wanna share a tiny issue that confuses me.

if width == original_width and height == original_height:
return self._fallback(img) # https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/preprocessing.py#L102

The function seems to fall back on central crop once the condition is satisfied, however in the original efficientet code, the function only falls back after max_attempts when the condition is satisfied.

It would be great if you could kindly take a look. Thank you.

Crashes in torch

Hello,
running the code I encounter this message from the torch implementation.

what(): owning_ptr == NullType::singleton() || owning_ptr->refcount_.load() > 0 INTERNAL ASSERT FAILED at /pytorch/c10/util/intrusive_ptr.h:348, please report a bug to PyTorch. intrusive_ptr:
I used the suggested versions. Do you have some advice?
Thank you.

how to gather the trained models on different workers?

HI, @ildoonet thanks for the work first, but during running the code python search.py -c confs/wresnet40x2_cifar10_b512.yaml --dataroot ... --redis ... on a Ray cluster, I find the header worker can't gather the models trained on worker nodes for the subsequent searching policy stage e.g. the main process on header node will throw exception: No such file or directory: '/FastAutoAugment/models/cifar10_wresnet40_2_ratio0.4_fold0.model. So can you tell me how did you train end-to-end using python search.py -c confs/wresnet40x2_cifar10_b512.yaml --dataroot ... --redis ... on the Ray cluster which has multiple nodes working in parallel?

Reproduce on CIFAR-10 with WRN40x2

I tried the following script with the found policy on CIFAR10 as described in README.

python FastAutoAugment/train.py -c confs/wresnet40x2_cifar10_b512.yaml --aug fa_reduced_cifar10 --dataset cifar10

With the script, I achieved 92.89% accuracy (i.e., 7.11% error rate) on the test set. However, the 3.7% error is reported in README. I think this gap is too large, so it seems a bug.

How to fix it?

(My PyTorch version is 1.1.0 and machine has 4 Titan Xp GPUs.)

Why can the algorithm work?

Intuitively, the Optimizer can choose NO Augment to achieve higher validate accuracy. Why can it work? Looking forward to your answer

loss metric in search.py is off, augmentation policies for CIFAR10/SVHN seem random.

Hello,

I have been trying FAA in my application.

I noticed that in search.py, line 116, you're basically taking the minimum loss over all the losses for each image.

Since your loss function defined in line 95 has no reduction, losses ends up being a vector (num_policies*batch_size,). Therefore you just get the minimum loss for a single image as your minimum loss for your metrics. Hence, if you are truly minimizing loss (as it says you do in the paper) using this code, then you're basically just getting mostly random noise as your reward_attr since there will probably always be at least one very good prediction giving low loss.

This may help explain why your CIFAR10 and SVHN policies are basically random. (I am using the policies from archive.py here:

Each augmentation appears roughly the same number of times, with almost uniform distributions of strength and probability of each. What's plotted is the normalized probability of each augmentation ((number of times it appears/total augmentations) * mean probability of the augmentation) on the y axis, vs average strength of each augmentation on the x axis. The same graph is true for SVHN.

image

image

On imagenet, you can see the distribution does seem a little bit less random - perhaps this is because loss will be a little more meaningful since there are 1000 classes so the minimum loss will be a slightly less noisy reward signal

image

By contrast, the augmentations from AutoAug:

image

This makes more sense to me: there should be some terrible operations that don't get used much, and some that are valuable and are used more...the fact they are roughly equal for cifar 10 is surprising.

So question: did you use top_1_valid to get the policies in archive? Or minus_loss? If it's the latter, was the code that is published being used?

Thanks!
Sean

Pyramidnet Issue

Hi,

I am currently trying to utilize the PyramidNet + ShakeDrop. However I am getting the following error:

RuntimeError: Output 0 of ShakeDropFunctionBackward is a view and is being modified inplace. This view was created inside a custom Function (or because an input was returned as-is) and the autograd logic to handle view+inplace would override the custom backward associated with the custom Function, leading to incorrect gradients. This behavior is forbidden. You can remove this warning by cloning the output of the custom Function.

If I try to fix the error by changing some lines, the memory usage seems to increase a lot. So I was wondering whether you also encountered the following errors.

Thank you!

there is an error occurs while exec search.py. could anybody help me?

1.cp search.py ../search.py

2.in the directory. ..../fast-autoaugment
exec the flowing command:
python search.py -c confs/wresnet40x2_cifar10_b512.yaml

got the error.
46%|████████████████████████████████████████████████████████████████████████▎ | 91/200 30:06<00:30, 3.54it/s, cv1=200, cv2=200, cv3=90, cv4=200, cv5=200
(pid=31751) 0200]: 80%|████████ | 16/20 [00:04<00:01, 3.63it/s, loss=0.299, top1=0.905, top5=0.997]
(pid=31751) 0200]: 90%|█████████ | 18/20 [00:04<00:00, 4.60it/s, loss=0.298, top1=0.905, top5=0.997]
46%|████████████████████████████████████████████████████████████████████████▎ | 91/200 30:07<00:30, 3.54it/s, cv1=200, cv2=200, cv3=90, cv4=200, cv5=200
[*test 0000/0200]: 100%|██████████| 20/20 [00:06<00:00, 3.31it/s, loss=0.298, top1=0.904, top5=0.997]
(pid=31751) 2019-12-26 15:58:24,892 ERROR worker.py:433 -- SystemExit was raised from the worker
(pid=31751) Traceback (most recent call last):
(pid=31751) File "python/ray/_raylet.pyx", line 711, in ray._raylet.task_execution_handler
(pid=31751) File "python/ray/_raylet.pyx", line 694, in ray._raylet.execute_task
(pid=31751) SystemExit: 0
170500096it [30:00, 94677.34it/s]
46%|████████████████████████████████████████████████████████████████████████▎

Code doesn't run

This code is broken it seems. there were many minor bugs that I fixed but now i see this error when running search.py. I have space left

OSError: [Errno 28] No space left on device
2020-03-13 03:06:59,2529ERROR trial_runner.py:345 -- Trial Runner checkpointing failed. 78), ('PAUSED', 0), ('ERROR', 3)])
Traceback (most recent call last):
File "/home/zhasan/anaconda3/envs/cs234/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 343, in step
self.checkpoint()
File "/home/zhasan/anaconda3/envs/cs234/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 272, in checkpoint
json.dump(runner_state, f, indent=2, cls=_TuneFunctionEncoder)
File "/home/zhasan/anaconda3/envs/cs234/lib/python3.6/json/init.py", line 180, in dump
fp.write(chunk)
OSError: [Errno 28] No space left on device

One code issue to confirm

In the code search.py approx L259 ,it is final_policy_set.extend(final_policy), I think it should be final_policy_set.append(final_policy)。

cutout on ImageNet

@ildoonet From config files, it seems there is no cutout on search stage on reduced_imagenet and eval stage on ImageNet. Would you like to confirm/share the detailed information about cutout on ImageNet?

the search hangs

Hi, thank you for the work first.
But when I started a searching experiment on a ray cluster just using command python search.py -c confs/wresnet40x2_cifar10_b512.yaml --dataroot ... --redis ... without modifying the code much, it would be hanging. Was there anything wrong?
hang

A question about search.py

Hi , man I want to excute seach.py in cifar 10 , I notice that fa_reduced_cifar10 is the result you got , but in wresnet40x2_cifar10_b512.yaml, there alreayd exist a key aug: fa_reduced_cifar10 , should I delete it and type with nothing ?

Stuck after iteration

After the iterative search in the parameter space is completed, it will get stuck, and there is no error message (399 is the last iteration).
iter 397 ma=0.509 OrderedDict([('RUNNING', 1), ('TERMINATED', 198), ('PENDING', 1), ('PAUSED', 0), ('ERROR', 0)]
2021-05-07 16:49:31,787 WARNING logger.py:126 -- Couldn't import TensorFlow - disabling TensorBoard logging.
2021-05-07 16:49:31,787 WARNING logger.py:220 -- Could not instantiate <class 'ray.tune.logger.TFLogger'> - skipping.
iter 398 ma=0.509 OrderedDict([('RUNNING', 2), ('TERMINATED', 198), ('PENDING', 0), ('PAUSED', 0), ('ERROR', 0)]
2021-05-07 16:49:48,651 INFO ray_trial_executor.py:178 -- Destroying actor for trial search_par_resnet50_fold1_ratio0.4_200_cv_fold=1,cv_ratio_test=0.4,dataroot=_home_ccf_project_SB_PAR_data_rapv2_,level_0_0=0.77372,level_0_1=0.45162,level_1_0=0.00049368,level_1_1=0.39083,level_2_0=0.46218,level_2_1=0.69141,level_3_0=0.0028208,level_3_1=0.27047,level_4_0=0.65674,level_4_1=0.84919,num_op=2,num_policy=5,policy_0_0=3,policy_0_1=7,policy_1_0=0,policy_1_1=10,policy_2_0=13,policy_2_1=7,policy_3_0=5,policy_3_1=12,policy_4_0=11,policy_4_1=1,prob_0_0=0.40213,prob_0_1=0.39349,prob_1_0=0.47788,prob_1_1=0.63856,prob_2_0=0.6497,prob_2_1=0.50779,prob_3_0=0.58183,prob_3_1=0.30122,prob_4_0=0.62576,prob_4_1=0.92233,save_path=_home_ccf_project_fastautoaugment_models_par_resnet50_ratio0.4_fold1.model. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
iter 399 ma=0.509 OrderedDict([('RUNNING', 1), ('TERMINATED', 199), ('PENDING', 0), ('PAUSED', 0), ('ERROR', 0)]

I found that if the following errors are reported before the iteration is completed, it will get stuck. If there are no errors, it can continue to next stage.
iter 364 ma=0.509 OrderedDict([('RUNNING', 2), ('TERMINATED', 181), ('PENDING', 17), ('PAUSED', 0), ('ERROR', 0)
(pid=45772) WARNING: Logging before InitGoogleLogging() is written to STDERR
(pid=45772) E0507 16:44:34.685359 45846 raylet_client.cc:345] IOError: [RayletClient] Connection closed unexpectedly. [RayletClient] Failed to push profile events.
ray==0.6.5
python=3.6.9
tensorflow not install
centos7

D_M and D_A portion

Could you please tell me the portion of D_M and D_A in each fold? Are they evenly splitted?

Questions about the initialization of the Ray server.

Hi, I've tried to run the code for searching policies, but I have trouble in initializing the Ray server.
It seems that there is someting wrong with the code "ray.init(redis_address=args.redis)" in search.py Line 164

Traceback (most recent call last):
File "search.py", line 166, in
ray.init(redis_address=args.redis)
File "/home/xcq/anaconda3/envs/pytorch-video/lib/python3.6/site-packages/ray/worker.py", line 1425, in init
redis_address = services.address_to_ip(redis_address)
File "/home/xcq/anaconda3/envs/pytorch-video/lib/python3.6/site-packages/ray/services.py", line 145, in address_to_ip
ip_address = socket.gethostbyname(address_parts[0])
UnicodeError: encoding with 'idna' codec failed (UnicodeError: label empty or too long)

Any suggestion about what I could be doing wrong/ how to workaround this?
ray version==0.6.5

using search.py to find out autoaugment

[2020-08-03 21:23:54,603] [Fast AutoAugment] [INFO] processed in 76.2692 secs [2020-08-03 21:23:54,603] [Fast AutoAugment] [INFO] ----- Search Test-Time Augmentation Policies ----- search_cifar10_wresnet40_2_fold0_ratio0.1 Traceback (most recent call last): File "search.py", line 230, in <module> algo = HyperOptSearch(space, max_concurrent=4*20, reward_attr=reward_attr) TypeError: __init__() got an unexpected keyword argument 'reward_attr' [*test 0000/0010]: 100%|██████████| 79/79 [00:01<00:00, 50.82it/s, loss=0.459, top1=0.848, top5=0.994, loss_ema=0.423]
when I use python search.py -c confs/wresnet40x2_cifar.yaml --aug default there are some errors. and i want to know where i can see the autoaugment policy i searched.

Two questions about fast-autoaugment (k-fold data, and training)

Hello, I have some questions.

In the paper, it is written that train dataset (D_train) is divided into k folds and divided into D_M and D_A based on ratio, and policy search process is operated for each fold. However, in the "data.py" code, I think that the data size of all folds is not divided into k.

Also, when training for D_train by integrating all policies at the end, it is implemented not by applying all policies to D_train, but by randomly selecting one of the subpolicies that synthesized the entire policies, performing transform, and learning.

I am wondering if I am misunderstanding these two things.

Thank you.

Which ResNet50 did you use?

Hi, I want to run your code for ImageNet, but it seems the ResNet-50 implementation is missing.

In fast-autoaugment/FastAutoAugment/networks/__init__.py,

from pretrainedmodels import models
...
model = models.resnet50(num_classes=num_class, pretrained=None)

but pretrainedmodels is not uploaded yet. Is this ResNet50 from torchvision or your original implementation?

Inaccurate "accuracy" when testing uploaded model.

I run testing code by using your provided model.
Cifar10
[Wide-ResNet-28-10 | 3.9 | 3.1 | 2.6 | 2.7 / 2.7 | Download]
Cifar100
[Wide-ResNet-28-10 | 18.8 | 18.4 | 17.1 | 17.3 / 17.3 | Download]
But I can't get the paper's results? Is something wrong ?
Looking forward to your reply, thank you~
The results of cifar10 are below:
[2020-11-12 06:10:48,729] [Fast AutoAugment] [WARNING] tag not provided, no tensorboard log.
[2020-11-12 06:10:48,730] [Fast AutoAugment] [INFO] ./FAA_Paper_models/cifar10_wresnet28x10_top1.pth file found. loading...
[2020-11-12 06:10:48,934] [Fast AutoAugment] [INFO] checkpoint epoch@10
[2020-11-12 06:10:48,941] [Fast AutoAugment] [INFO] optimizer.load_state_dict+
[2020-11-12 06:10:48,950] [Fast AutoAugment] [INFO] evaluation only+
[2020-11-12 06:11:57,150] [Fast AutoAugment] [INFO] done.
[2020-11-12 06:11:57,151] [Fast AutoAugment] [INFO] model: {'type': 'wresnet28_10'}
[2020-11-12 06:11:57,151] [Fast AutoAugment] [INFO] augmentation: fa_reduced_cifar10
[2020-11-12 06:11:57,151] [Fast AutoAugment] [INFO]
{
"loss_train": NaN,
"loss_valid": 0.0,
"loss_test": NaN,
"top1_train": 0.09475160256410256,
"top1_valid": 0.0,
"top1_test": 0.0909,
"top5_train": 0.4834735576923077,
"top5_valid": 0.0,
"top5_test": 0.4729,
"epoch": 0
}
[2020-11-12 06:11:57,152] [Fast AutoAugment] [INFO] elapsed time: 0.021 Hours
[2020-11-12 06:11:57,152] [Fast AutoAugment] [INFO] top1 error in testset: 0.9091
[2020-11-12 06:11:57,152] [Fast AutoAugment] [INFO] ./FAA_Paper_models/cifar10_wresnet28x10_top1.pth

Search policy on my custom dataset

Hi, the code seems good for research purposes. There is not much documentation, but that's ok.
I think I've managed to understand what it does and how it works.

Now I need to use this package on my new custom datasets (none of the default torchvision.datasets) for production purposes.

Any idea on how to run the search.py with custom dataset?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.