kiddyboots216 / commefficient Goto Github PK

View Code? Open in Web Editor NEW

69.0 69.0 20.0 799 KB

PyTorch for benchmarking communication-efficient distributed SGD optimization algorithms

Python 99.63% Shell 0.37%

commefficient's People

Contributors

Stargazers

Watchers

commefficient's Issues

Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'

It happens this error when I run cv_train.py.

Help with the PersonaChat experiment command

Hello authors, thank you for your great paper -- I'll be citing it in my upcoming work.

I am trying to run your code, specifically the PersonaChat experiments. Would you be able to provide the exact set of commands used to generate the results in Figure 5 (i.e. ~15 ppl)? I have read the paper and all of your code (you did a lot of work on this!), and I wanted to confirm the command for the run. It would be extremely helpful if you were to share your commands.

At the moment I am simply trying to replicating the uncompressed results. I now have:

 python gpt2_train.py \
--dataset_name PERSONA \
--local_momentum 0 \
--dataset_dir ./dataset/personachat \
--mode uncompressed \
--seed 42 \
--local_batch_size -1 \
--num_results_train 1 --num_results_val 2 \
--num_epochs 1 \
--valid_batch_size 4 \
--port 6239 \
--num_workers 4 --num_devices 4

Is this right? Have I missed a hyperparameter that might be important?

I tried changing the learning rate to 0.16 with --lr_scale 0.16, but I quickly get NaNs. Should I set lm_coef or mc_coef to something other than the default? The results I'm getting are quite different (worse) than the paper, so I'm trying to find out what is the discrepancy.

Thank you so much for your help! I really do appreciate it.

Resnet9 Pooling

Hi authors,

While I was training on CIFAR10 using Resnet9 (from models/resnet9.py) with default settings(i.e. the channel sizes), I got an error on the last pooling layer out = self.pool(out).view(out.size()[0], -1) in BasicNet.forward.

After printing out the tensor size of the input into the pooling layer, I found that the input was of size [batch_size, 512, 3, 3]. I fixed this error by changing the last pooling layer to self.pool = nn.MaxPool2d(2), which made sense since the last linear layer expects a tensor of size [batch_size, 512].

Is this an error that has happened before or could there have been something else that went wrong? I see that the last pooling layer was hardcoded to be nn.MaxPool2d(4), so was the expected input of size [batch_size, 512, 5, 5] and I did something wrong?

Thanks,
Howard

Can't pickle <function compute_loss_train at 0x7fbc72bf4dc0>

Hi, I'm having a new problem reproducing that code，could you please help me to fix it? Thanks a lot!

Question about the final test_acc in cifar10 experiment.

Hi, I tried to reproduce the experiment results in the paper. I am using the following commands:

python cv_train.py --dataset_name CIFAR10 --model ResNet9 --mode fedavg --num_clients 200 --num_workers 10 --num_rows 1 --num_cols 50000 --error_type none --local_momentum 0.0 --virtual_momentum 0.9 --max_grad_norm 1.0 --num_devices=1 --lr_scale 0 --local_batch_size -1 --share_ps_gpu

the accuracy seems not correct, could you help me solve the problem? the logs are:

MY PID: 3424
Namespace(do_test=False, 50000 125
Using BatchNorm: False
grad size 6568640
Finished initializing epoch lr 1 0.0800 25.2243 2 0.1600 23.8377 3 0.2400 24.0562 4 0.3200 23.0985 5 0.4000 22.0071 6 0.3789 21.9321 7 0.3579 21.9460 8 0.3368 21.8156 9 0.3158 21.6892 10 0.2947 21.9432 11 0.2737 21.9095 12 0.2526 27.3621 13 0.2316 37.6449 14 0.2105 32.6825 15 0.1895 22.3018 16 0.1684 30.7159 17 0.1474 22.4368 18 0.1263 39.1600 19 0.1053 38.9138 20 0.0842 21.9821 21 0.0632 32.8850 22 0.0421 38.2427 23 0.0211 38.3396 HACK STEP
WARNING: LR is 0
WARNING: LR is 0
24 0.0000 33.1543 done training mode='fedavg', robustagg='none', use_tensorboard=False, seed=21, model='ResNet9', do_finetune=False, do_dp_finetune=False, do_checkpoint=False, checkpoint_path='/data/nvme/ashwinee/CommEfficient/CommEfficient/checkpoints/', finetune_path='./finetune', finetuned_from=None, num_results_train=2, num_results_val=2, dataset_name='CIFAR10', dataset_dir='./dataset', do_batchnorm=False, nan_threshold=999, k=50000, num_cols=50000, num_rows=1, num_blocks=20, do_topk_down=False, local_momentum=0.0, virtual_momentum=0.9, weight_decay=0.0005, num_epochs=24, num_fedavg_epochs=1, fedavg_batch_size=-1, fedavg_lr_decay=1.0, error_type='none', lr_scale=0.4, pivot_epoch=5, port=5315, num_clients=200, num_workers=10, device='cuda', num_devices=1, share_ps_gpu=True, do_iid=False, train_dataloader_workers=0, val_dataloader_workers=0, model_checkpoint='gpt2', num_candidates=2, max_history=2, local_batch_size=-1, valid_batch_size=8, microbatch_size=-1, lm_coef=1.0, mc_coef=1.0, max_grad_norm=1.0, personality_permutations=1, eval_before_start=False, checkpoint_epoch=-1, finetune_epoch=12, do_malicious=False, mal_targets=1, mal_boost=1.0, mal_epoch=0, mal_type=None, do_mal_forecast=False, do_pgd=False, do_data_ownership=False, mal_num_clients=-1, layer_freeze_idx=0, mal_layer_freeze_idx=0, mal_num_epochs=1, backdoor=-1, do_perfect_knowledge=False, do_dp=False, dp_mode='worker', l2_norm_clip=1.0, noise_multiplier=0.0, client_lr=0.1)
in 1.91 seconds
train_time train_loss train_acc test_loss test_acc total_time
2.3028 0.1038 2.3012 0.1405 30.4418
2.3025 0.1017 2.2936 0.1460 57.5426
2.2886 0.1157 2.2449 0.1507 84.8176
2.2479 0.1461 2.1887 0.1535 111.1938
2.2901 0.0944 2.2941 0.0930 136.4487
2.3150 0.1301 3.3015 0.0997 161.6546
2.3782 0.1078 2.2771 0.1324 186.8818
2.2793 0.1264 2.2281 0.1360 211.9292
2.2410 0.1775 2.2307 0.1417 236.9210
2.2989 0.1024 2.2831 0.1175 262.0983
2.2511 0.1332 2.1657 0.1901 287.2876
2.1729 0.1771 2.1231 0.1734 321.2075
2.1274 0.1580 2.1067 0.2008 365.1934
2.3116 0.1308 2.0721 0.2026 401.1535
2.1435 0.1707 2.0014 0.2332 426.7760
2.0729 0.1982 2.1173 0.2312 460.7642
2.1110 0.2006 2.0027 0.2580 489.7420
2.0538 0.1897 2.0412 0.2377 535.3520
2.0614 0.2156 2.0193 0.2655 580.7346
1.9763 0.2441 2.0301 0.2679 605.9769
1.9892 0.2655 2.0524 0.2711 645.6084
1.9478 0.2627 1.8612 0.3094 690.2626
1.9010 0.2778 1.8869 0.2993 735.1110
1.9016 0.2929 1.8394 0.3032 771.5566

PersonaChat (GPT2)

Hello, sorry to bother you again, I want to reproduce the experiment of this picture, could you tell me what the running parameters are?

Query for the type of malicious attack

Dear authors:
I am now studying this paper and source code. I would like to ask that what kinds of attack the letters "A", "B", "C" and "D" (in "MAL_ATTACK_TYPES" in utils.py) represent respectively.
Thank you!

ZeroDivisionError: integer division or modulo by zero

per_proc = len(worker_batches) // len(self.update_forward_grad_ps)
How can I set the number of processes and clients to avoid "updating_ forward_ grad_ ps" becomes an empty array?

issue of fetchpgd

hi, sorry to bother you again.
When I want to only implement fetchpgd on commefficient-attack, to test the accuracy difference between SpaseFed and fetchSGD methods. I encounter a problem.
My hyperparameter is --dataset_dir data/cifar10 --tensorboard --dataset_name CIFAR10 --model ResNet9 --mode fetchpgd --
k 10000 --num_blocks 1 --num_rows 1 --num_cols 325000 --num_clients 200 --num_workers 10 --error_type virtual --local_momentum 0.0 --virtual_momentum 0.9
the K, rows, cols value I use is same with fetchSGD.
But I got: CommEfficient-attacks\CommEfficient-attacks\CommEfficient\fed_worker.py", line 177, in worker_loop
sum_g += g
RuntimeError: The size of tensor a (500000) must match the size of tensor b (6568640) at non-singleton dimension 1
Could you please help me to fix it? THX a lot!

No module named "fixup"

When I run "python cv_train.py" I got an error.

TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'

Hello,

I'm trying to run the code in cv_train.py using the command line arguments "--dataset_name CIFAR10 --iid --share_ps_gpu --num_workers 1 --num_clients 1" with pytorch 1.8.0, however I am met with the following error:

File "D:\CommEfficient\CommEfficient\cv_train.py", line 405, in
lr_scheduler = LambdaLR(opt, lr_lambda=lambda_step)

File "D:\anaconda3\envs\test\lib\site-packages\torch\optim\lr_scheduler.py", line 203, in init
super(LambdaLR, self).init(optimizer, last_epoch, verbose)

File "D:\anaconda3\envs\test\lib\site-packages\torch\optim\lr_scheduler.py", line 77, in init
self.step()

File "D:\anaconda3\envs\test\lib\site-packages\torch\optim\lr_scheduler.py", line 152, in step
values = self.get_lr()

File "D:\anaconda3\envs\test\lib\site-packages\torch\optim\lr_scheduler.py", line 250, in get_lr
return [base_lr * lmbda(self.last_epoch)

File` "D:\anaconda3\envs\test\lib\site-packages\torch\optim\lr_scheduler.py", line 250, in
return [base_lr * lmbda(self.last_epoch)

File "D:\Dewen\GitHub\CommEfficient-master\CommEfficient\cv_train.py", line 404, in
lambda_step = lambda step: lr_schedule(step / spe)

File "D:\Dewen\GitHub\CommEfficient-master\CommEfficient\utils.py", line 28, in call
return np.interp([t], self.knots, self.vals)[0]

File "<array_function internals>", line 180, in interp
File "D:\anaconda3\envs\test\lib\site-packages\numpy\lib\function_base.py", line 1570, in interp
return interp_func(x, xp, fp, left, right)

TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'

Have you encountered this error? I've tried running the code in both a Linux OS and Windows OS, but exhibit the same issue. Am I missing a command line argument?

Reproduce the results in the paper

Hi, I tried to reproduce the experiment results in the paper. I am using the following commands. But the logs seem not correct. Could you share the command line you are using in the paper? I am really interested in your work and willing to explore more about sketch techniques.

python cv_train.py --dataset_name CIFAR10 --iid --num_workers 2 --lr_scale 0.4 --local_momentum=0.0 --num_devices 2 --num_devices=2 --num_clients 2

MY PID: 31280
5315 port in use, trying next...
Namespace(checkpoint_path='./checkpoint', dataset_dir='./dataset', dataset_name='CIFAR10', device='cuda', do_batchnorm=False, do_checkpoint=False, do_dp=False, do_finetune=False, do_iid=True, do_test=False, do_topk_down=False, dp_mode='worker', error_type='none', eval_before_start=False, fedavg_batch_size=-1, fedavg_lr_decay=1, finetune_path='./finetune', finetuned_from=None, k=50000, l2_norm_clip=1.0, lm_coef=1.0, local_batch_size=8, local_momentum=0.0, lr_scale=0.4, max_grad_norm=None, max_history=2, mc_coef=1.0, microbatch_size=-1, mode='sketch', model='ResNet9', model_checkpoint='gpt2', nan_threshold=999, noise_multiplier=0.0, num_blocks=20, num_candidates=2, num_clients=2, num_cols=500000, num_devices=2, num_epochs=24, num_fedavg_epochs=1, num_results_train=2, num_results_val=2, num_rows=5, num_workers=2, personality_permutations=1, pivot_epoch=5, port=5646, seed=21, share_ps_gpu=False, train_dataloader_workers=0, use_tensorboard=False, val_dataloader_workers=0, valid_batch_size=8, virtual_momentum=0, weight_decay=0.0005)
50000 625
Using BatchNorm: False
Finished initializing in 11.00 seconds
miniconda3/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of lr_scheduler.step() before optimizer.step(). "
CommEfficient/CommEfficient/utils.py:258: UserWarning: This overload of add_ is deprecated:
add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
add_(Tensor other, *, Number alpha) (Triggered internally at ../torch/csrc/utils/python_arg_parser.cpp:1055.)
grad_vec.add_(args.weight_decay / args.num_workers, weights)
epoch lr train_time train_loss train_acc test_loss test_acc down (MiB) up (MiB) total_time
1 0.0800 655.4752 2.3025 0.1009 2.3025 0.1014 0 59606 679.6477
2 0.1600 649.9156 2.3025 0.1008 2.3025 0.1014 0 59606 1343.1710
3 0.2400 649.3290 2.3025 0.1011 2.3025 0.1014 0 59606 2006.0574

Variable velocity

Is the velocity stand for momentum? Cuz there's no velocity in the paper.

If I set args.local_momentum = 0 mode = sketch, will the local_velocity always be none?

Could the authors provide more commands to reproduce the entire experiment in the paper?

Question about Cifar10 experiment command

Hello, authors. When I run "python cv_train.py", the train_loss and train_acc are always NaNs.
There is my command:

python cv_train.py
--dataset_dir /home/data/cifar
--dataset_name CIFAR10
--num_results_train 1
--train_dataloader_workers 4
--val_dataloader_workers 4
--num_devices 2
--error_type virtual
--lr_scale 0.3
--num_workers 4
—num_clients 10000

Have I missed some hyperparameters that are important? And would you please be able to provide the exact set of commands? It would be helpful if you were to share your commands.

issue of sketch

hi,
Sorry to bother you, but I have some problems when deploying Sketch.
I got 0 grad updates and 0 download bytes when I do me the second iteration. Could you please tell me what is the problem?
My hyperparameters are --dataset_dir data/cifar10 --tensorboard --local_batch_size 50 --dataset_name CIFAR10 --model ResNet9 --mode sketch --num_clients 200 --num_workers 10 --num_rows 1 --num_cols 50000 --error_type none --local_momentum 0.0 --virtual_momentum 0.9 --max_grad_norm 1.0

I think the problem maybe sketch.accumulateVec(grad), but I don't know how to modify it.

Thanks a lot.

kiddyboots216 / commefficient Goto Github PK

commefficient's People

Contributors

Stargazers

Watchers

Forkers

commefficient's Issues

Recommend Projects

Recommend Topics

Recommend Org